Struct to JSON in Big query google analytics - google-analytics

I have a query which has an ouput as below attached screen shot. Here is the query
#standardSQL
select
visitNumber,
visitId,
fullVisitorId,
hits.customDimensions
from table_a
left join UNNEST(hits) as hits limit 10;
Below is one row and I want the output of customDimension as JSON as below
I tried using TO_JSON_STRING function in big query, it didnt give the output as below. I also tried ARRAY, ARRAY_CONCAT but couldnt get it above format. Appreciate if some one can help.

Below is for BigQuery Standard SQL and can be good start for you to tweak for your specific needs
#standardSQL
SELECT
visitNumber,
visitId,
fullVisitorId,
(
SELECT CONCAT('[',STRING_AGG(CONCAT('{"',CAST(index AS STRING), '":', '"', IFNULL(value, ''), '"', '}'), ','), ']')
FROM UNNEST(hits.customDimensions)
) AS customDimensions
FROM table_a
LEFT JOIN UNNEST(hits) AS hits
LIMIT 10

Related

How to query Direct returning visitor in BigQuery

I am trying to figure out how many users returned as Direct users after visiting the website as Organic using BigQuery
This is what I did so far. In order to get the number of users who came back as Direct after visiting as Organic, I used
organic_user.visitNumber < direct_user.visitNumber
in WHERE clause.
SELECT
organic_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM
(
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Organic Search'
) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`ga_sessions_*`,
UNNEST(hits) as hits
WHERE
DATE BETWEEN '20190814'
AND '20190911'
AND channelGrouping = 'Direct'
) AS direct_user ON organic_user.fullVisitorId = direct_user.fullVisitorId
WHERE
organic_user.visitNumber < direct_user.visitNumber
GROUP BY
date
ORDER BY
date ASC
Could anyone verify this query is correct?
If not, could you provide a solution for this?
With all the clarifications you provided in the comments, I was able to come up with some adaptations of your original query:
SELECT
direct_user.date,
COUNT (DISTINCT direct_user.fullVisitorId) AS return_direct_user
FROM (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Organic Search' ) AS organic_user
INNER JOIN (
SELECT
date,
fullVisitorId,
visitNumber
FROM
`bigquery-public-data`.google_analytics_sample.`ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
DATE BETWEEN '20161214'
AND '20180911'
AND channelGrouping = 'Direct' ) AS direct_user
ON
organic_user.fullVisitorId = direct_user.fullVisitorId
AND organic_user.visitNumber < direct_user.visitNumber
GROUP BY
direct_user.date
ORDER BY
direct_user.date ASC
Here are some considerations about the changes I made:
I noticed it was important to specify the subquery group date we
are using for the group by. Since we are counting ‘Direct’ visits
per day, it makes sense we count when they happen.
I moved the organic_user.visitNumber < direct_user.visitNumber
condition to the JOIN clause, I know for INNER JOINs it does not
make any technical difference, but for semantic reasons I thought it
belong there.
I hope this information results to be helpful to you.

Joining to landing pages query doubles the sessions per source

I'm trying to query sum of visits per source from a Big Query table of Google Analytics data, but will need to filter some sessions out at landing page level. Hence I'm pre-querying visitIDs by landing page and re-joining to session data like so:
#StandardSQL
WITH landingpages AS (
SELECT
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
# filters to be added here
)
SELECT
sessions.trafficSource.source,
SUM(sessions.totals.visits) AS visits
FROM `project.dataset.ga_sessions_*` AS sessions
JOIN
landingpages
ON
landingpages.visitID = sessions.visitID
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
GROUP BY
trafficSource.source
ORDER BY
visits DESC
This roughly doubles the number of sessions per each source as reported from GA.
Can anyone point out what I've done wrong? (I suspect it is blindingly obvious)
I've tried examining the data output from the first query and can't find anything wrong with it aside from a very small proportion of duplicated visitIDs. I've also tried various different types of JOIN, all to now avail.
When querying ga data from GBQ it's imperative to know and keep in mind that a unique visit is represented by both a fullVisitorID and visitID. Only a double join on both will return a meaningful data set.
Here's what I should have written:
#StandardSQL
WITH landingpages AS (
SELECT
fullVisitorId,
visitID,
h.page.pagePath AS LandingPage
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
),
session_data AS (
SELECT
date AS ga_date, trafficSource.source AS source, fullVisitorId, visitID, SUM(totals.visits) AS visits
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20150926' AND '20150926'
AND
totals.visits > 0
GROUP BY ga_date, source, fullVisitorId, visitID
)
SELECT
ga_date, source, SUM(visits) AS Sessions
FROM
landingpages
JOIN
session_data
ON
landingpages.VisitID = session_data.VisitID
AND
landingpages.fullVisitorId = session_data.fullVisitorId
GROUP BY
ga_date, source
ORDER BY
Sessions DESC

Need BigQuery SQL query to collect time on page from Google Analytics data

can anyone help with a BIgQuery SQL query to extract the time on page for a specific page from Google Analytics data please?
For every visitorId who has visited a particular page I would like the time on page for that page. This is so that I can calculate the median time on page rather than the mean.
I'm assuming that the visitorId, hits.hitsNumber and hits.time dimensions will be needed. Also that somehow the hits.time for the hit where the page was viewed will need to be subtracted from the hits.time of the following hit.
Any help much appreciated.
Try this:
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
nextTime,
(nextTime - hits.time) as timeOnPage
FROM(
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.time ASC) as nextTime
FROM [PROJECTID:DATASETID.ga_sessions_YYYYMMDD]
WHERE hits.type = "PAGE"
)
The key to this code is the LEAD() function, which grabs the specified value from the next row in the partition, based on the PARTITION BY and ORDER BY qualifiers.
Hope that helps!
To Account for last page time, this query can be used, and it will give zero time on last page,since BQ doesn't have way to calculate time spent on last page, but it will at least give zero instead of null.
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.time,
nextTime,
CASE
WHEN hits.isExit IS NOT NULL THEN last_interaction - hit_time
ELSE next_pageview - hit_time
END
AS time_on_page
FROM(
SELECT
fullVisitorId,
hits.page.hostname,
hits.page.pagePath,
hits.hitNumber,
hits.isExit,
hits.time/1000 as hits_time,
LEAD(hits.time/1000, 1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.time ASC) as nextTime,
MAX(IF(hits.isInteraction = TRUE,hits.time / 1000,0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
FROM [PROJECTID:DATASETID.ga_sessions_YYYYMMDD]
WHERE hits.type = "PAGE"
)
Google Analytics 4
Time-Spent-On-Page
SELECT
user_pseudo_id,
event_timestamp,
ga_session_id,
page_location,
page_title,
next_hit_in_the_same_session,
(next_hit_in_the_same_session - event_timestamp)/1000000 AS time_on_page_in_seconds
FROM (
SELECT
user_pseudo_id,
event_timestamp,
(
SELECT value.int_value FROM
UNNEST(event_params)
WHERE
key = 'ga_session_id') AS ga_session_id,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'page_location') AS page_location,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'page_title') AS page_title,
LEAD(event_timestamp) OVER (PARTITION BY (SELECT value.int_value FROM UNNEST(event_params)
WHERE
key = 'ga_session_id')
ORDER BY
event_timestamp ASC) AS next_hit_in_the_same_session
FROM
-- Replace table name.
`cloud-search.analytics_298504139.events_20220106` AS tableAlias
WHERE
event_name = 'page_view'
ORDER BY
user_pseudo_id,
ga_session_id,
event_timestamp ASC )

GA Average Time On Page In BigQuery

I'm having trouble working out average time on page from the back end GA BigQuery export data and wondering if someone might see if code below looks reasonable.
I'm having trouble getting it to match that from query explorer tool.
Is there a way to run query explorer tool for the LondonCycleHelmet data?
Any help much appreciated, thanks
select
pageviews,
exit_pageviews,
sum_hit_length_seconds,
sum_hit_length_seconds / (pageviews - exit_pageviews) as avg_time_on_page
from
(
select
SUM(hit_length_seconds) as sum_hit_length_seconds,
COUNT(IF(hits.type='PAGE',(CONCAT(session_key,'_',hits.page.hostname,'_',hits.page.pagePath)),NULL)) AS pageviews,
COUNT(IF((next_hit_time is null) or (hits.hitNumber=hits_hitNumber_max),(CONCAT(session_key,'_',hits.page.hostname,'_',hits.page.pagePath)),NULL)) AS exit_pageviews,
from
(
select
*,
(next_hit_time-hits.time)/1000 as hit_length_seconds,
from
(
select
fullVisitorId,
visitId,
visitorId,
hits.type,
hits.time,
hits.hitNumber,
hits.page.hostname,
hits.page.pagePath,
-- create some keys to handle data later
concat(fullVisitorId,"_",string(visitId)) as session_key,
concat(fullVisitorId,"_",string(visitId),"_",string(hits.hitNumber),"_",string(hits.time)) as hit_key,
-- get max and min number of hits for each session
MAX(hits.hitNumber) WITHIN RECORD AS hits_hitNumber_max,
MIN(hits.hitNumber) WITHIN RECORD AS hits_hitNumber_min,
-- get min and max hit times to work out full session length
MAX(hits.time) WITHIN RECORD AS hits_time_max,
MIN(hits.time) WITHIN RECORD AS hits_time_min,
-- get next and previous hit time to be able to work out length of each hit
LAG(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) as previous_hit_time,
LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) as next_hit_time,
from
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
)
)
UPDATE/CLARIFICATION:
I think it's when i look at an individual page across time that it starts going out of whack.
For example if i run below in BigQuery:
select
pageviews,
exit_pageviews,
sum_hit_length_seconds,
sum_hit_length_seconds / (pageviews - exit_pageviews) as avg_time_on_page
from
(
select
SUM(hit_length_seconds) as sum_hit_length_seconds,
COUNT(IF(hits.type='PAGE',(CONCAT(session_key,'_',hits.page.hostname,'_',hits.page.pagePath)),NULL)) AS pageviews,
COUNT(IF((next_hit_time is null) or (hits.hitNumber=hits_hitNumber_max),(CONCAT(session_key,'_',hits.page.hostname,'_',hits.page.pagePath)),NULL)) AS exit_pageviews,
from
(
select
*,
(next_hit_time-hits.time)/1000 as hit_length_seconds,
from
(
select
fullVisitorId,
visitId,
visitorId,
hits.type,
hits.time,
hits.hitNumber,
hits.page.hostname,
hits.page.pagePath,
-- create some keys to handle data later
concat(fullVisitorId,"_",string(visitId)) as session_key,
concat(fullVisitorId,"_",string(visitId),"_",string(hits.hitNumber),"_",string(hits.time)) as hit_key,
-- get max and min number of hits for each session
MAX(hits.hitNumber) WITHIN RECORD AS hits_hitNumber_max,
MIN(hits.hitNumber) WITHIN RECORD AS hits_hitNumber_min,
-- get min and max hit times to work out full session length
MAX(hits.time) WITHIN RECORD AS hits_time_max,
MIN(hits.time) WITHIN RECORD AS hits_time_min,
-- get next and previous hit time to be able to work out length of each hit
LAG(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) as previous_hit_time,
LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) as next_hit_time,
from
[XXX.ga_sessions_20151001],
[XXX.ga_sessions_20151002],
[XXX.ga_sessions_20151003],
where
hits.page.pagePath='/2015/10/01/blah-blah/'
)
)
)
I get:
[
{
"pageviews": "24002",
"exit_pageviews": "22468",
"sum_hit_length_seconds": "455762.1240000001",
"avg_time_on_page": "297.10699087353333"
}
]
But if i look at query explorer like this:
I get:
So it looks like pageviews match but both exits and time on page seem quite different and i cant figure out why.
Can anyone recreate this example on your own data?
Have a feeling its to do with how exits and time on page are calculate in GA but could not find any examples in BQ GA cookbook of how to calculate time on page or exits.

How to get the Google Analytics definition of unique page views in Bigquery

https://support.google.com/analytics/answer/1257084?hl=en-GB#pageviews_vs_unique_views
I'm trying to calculate the sum of unique page views per day which Google analytics has on its interface
How do I get the equivalent using bigquery?
There are two ways how this is used:
1) One is as the original linked documentation says, to combine full visitor user id, and their different session id: visitId, and count those.
SELECT
EXACT_COUNT_DISTINCT(combinedVisitorId)
FROM (
SELECT
CONCAT(fullVisitorId,string(VisitId)) AS combinedVisitorId
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE' )
2) The other is just counting distinct fullVisitorIds
SELECT
EXACT_COUNT_DISTINCT(fullVisitorId)
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE'
If someone wants to try out this on a sample public dataset there is a tutorial how to add the sample dataset.
The other queries didn't match the Unique Pageviews metric in my Google Analytics account, but the following did:
SELECT COUNT(1) as unique_pageviews
FROM (
SELECT
hits.page.pagePath,
hits.page.pageTitle,
fullVisitorId,
visitNumber,
COUNT(1) as hits
FROM [my_table]
WHERE hits.type='PAGE'
GROUP BY
hits.page.pagePath,
hits.page.pageTitle,
fullVisitorId,
visitNumber
)
For uniquePageViews you better want to use something like this:
SELECT
date,
SUM(uniquePageviews) AS uniquePageviews
FROM (
SELECT
date,
CONCAT(fullVisitorId,string(VisitId)) AS combinedVisitorId,
EXACT_COUNT_DISTINCT(hits.page.pagePath) AS uniquePageviews
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE
hits.type='PAGE'
GROUP BY 1,2)
GROUP EACH BY 1;
So, in 2022 EXACT_COUNT_DISTINCT() seems to be deprecated..
Also for me the following combination of fullvisitorid+visitNumber+visitStartTime+hits.page.pagePath was always more precise than the above solutions:
SELECT
SUM(Unique_PageViews)
FROM
(SELECT
COUNT(DISTINCT(CONCAT(fullvisitorid,"-",CAST(visitNumber AS string),"-",CAST(visitStartTime AS string),"-",hits.page.pagePath))) as Unique_PageViews
FROM
`mhsd-bigquery-project.8330566.ga_sessions_*`,
unnest(hits) as hits
WHERE
_table_suffix BETWEEN '20220307'
AND '20220313'
AND hits.type = 'PAGE')

Resources