I want to calculate the total timeOnSite for all visitors to a website (and divide it by 3600 because it's stored as seconds in the raw data), and then I want to break it down on content_group and a custom variable that is called content_level.
Problem arises because content_group and content_level are both nested in arrays, and timeOnSite is a totals.-stored variable that gets inflated if when used in a query that include and unnesting. (content_group is a normal hits.-nested variable, while content_level is nested in customDimensions that is nested in hits (a second level nested variable)
(Will and Thomas C explain well why this problem emerges in this question Google Analytics Metrics are inflated when extracting hit level data using BigQuery , but I was unable to apply their advice to the totals.timeOnSite metric)
#StandardSQL
SELECT
date,
content_group,
content_level,
SUM(sessions) AS sessions,
SUM(sessions2) AS sessions2,
SUM(time_on_site) AS time_on_site
FROM (
SELECT
date AS date,
hits.contentGroup.contentGroup1 AS content_group,
(SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,
SUM(totals.visits) AS sessions,
COUNT(DISTINCT CONCAT(cast(visitId AS STRING), fullVisitorId)) AS sessions2,
SUM(totals.timeOnSite)/3600 AS time_on_site
FROM `projectname.123456789.ga_sessions_20170101`,
unnest(hits) AS hits
GROUP BY
iso_date, content_group, content_level
ORDER BY
iso_date, content_group, content_level
)
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level
(I use a subquery because I'm planning on pulling data from several tables using UNION_ALL, but I omitted that syntax because I deemed it not relevant for the question.)
Questions:
*Is it possible to make "local unnestings" for both hits. and hits.customDimensions so that it would be possible to use totals.timeOnSite in my query without it being inflated?
*Is it possible to make a workaround for time on site like I've made with sessions and sessions2?
*Is there a third, hidden solution to this problem?
I couldn't fully test this one but it seems to be working against my dataset:
SELECT
DATE,
COUNT(DISTINCT CONCAT(fv, CAST(v AS STRING))) sessions,
AVG(tos) avg_time_on_site,
content_group,
content_level
FROM(
SELECT
date AS date,
fullvisitorid fv,
visitid v,
ARRAY(SELECT DISTINCT contentGroup.contentGroup1 FROM UNNEST(hits)) AS content_group,
ARRAY(SELECT DISTINCT value FROM UNNEST(hits) AS hits, UNNEST(hits.customDimensions) AS custd WHERE index = 51) AS content_level,
totals.timeOnSite / 3600 AS tos
FROM `dataset_id.ga_sessions_20170101`
WHERE totals.timeOnSite IS NOT NULL
)
CROSS JOIN UNNEST(content_group) content_group
LEFT JOIN UNNEST(content_level) content_level
GROUP BY
DATE, content_group, content_level
What I tried to do is first to avoid the UNNEST(hits) operation on the entire dataset. Therefore, in the very first SELECT statement, content_group and content_level are stored as ARRAYs.
In the next SELECT, I unnested both of those ARRAYs and counted for the total sessions and the average time on site while grouping for the desired fields (I used the average here as it seems to make more sense when dealing with time on site but if you need the summation you can just change the AVG to SUM).
You won't have the problem of repeated timeOnSite in this query because the outer UNNEST(hits) was avoided. When the UNNEST(content_group) and UNNEST(content_level) happens, each value inside those ARRAYs gets associated only once to its correspondent time_on_site so no duplication is happening.
It might seem odd that I'm answering my own question like this, but a contact of mine from outside of Stack Overflow helped me solve this, so it's actually his answer rather than mine.
The problem with session_duration can be solved by using a window function (you can read more about window functions in the BigQuery documentation: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions)
#StandardSQL
SELECT
iso_date,
content_group,
content_level,
COUNT(DISTINCT SessionId) AS sessions,
SUM(session_duration) AS session_duration
FROM (
SELECT
date AS iso_date,
hits.contentGroup.contentGroup1 AS content_group,
(SELECT MAX(IF(index=51, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_level,
CONCAT(CAST(fullVisitorId AS STRING), CAST(visitId AS STRING)) AS SessionId,
(LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) - hits.time) / 3600000 AS session_duration
FROM `projectname.123456789.ga_sessions_20170101`,
unnest(hits) AS hits
WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170131"
AND (SELECT
MAX(IF(index=51, value, NULL))
FROM
UNNEST(hits.customDimensions)
WHERE
value IN ("web", "phone", "tablet")
) IS NOT NULL
GROUP BY
iso_date, content_group, content_level
ORDER BY
iso_date, content_group, content_level
)
GROUP BY iso_date, content_group, content_level
ORDER BY iso_date, content_group, content_level
Both LEAD - OVER - PARTITION in the subselect and the subsubselect in the WHERE-clause are required for the window function to work properly.
A more accurate way of calculating sessions is also provided.
Related
In BigQuery, the following error appears after execution of a scheduled query:
Resources exceeded during query execution: Not enough resources for
query planning - too many subqueries or query is too complex
I admit the query is quite complex, with multiple OVER () clauses including PARTITION BY and ORDER BY in the OVER() clauses, which are expensive from a computational perspective. However this is needed to accomplish the desired result. I need this OVER() clauses to get the desired resulting table. The query is appr 50GB.
The scheduled query queries data over 4 days of Google Analytics-related data.
However, remarkably, when I'm running the same query on a manual basis, the query executes without any problems (appr 35 seconds query time). Even when I manually execute the query with 365 days of GA-data, the query executes successfully. This query is 4TB (appr 280 seconds query time).
Does anyone know why scheduled queries fail in my case while manual queries can be executed without errors? And - given the fact that the scheduling is important - does anyone know if there is a fix so that the scheduled query can be executed without errors?
Basically, it's this query, see below. Note that I hided the input table so reduced the query length a bit. The input table is just a collection of SELECT queries to merge multiple input tables using UNION ALL.
Note as well that I am trying to connect hits from separate sources, Firebase Analytics (app data) and Universal Analytics (web data), into custom session id's where this is needed, and if this is not needed use the regular visit id's from GA.
SELECT
*,
MAX(device) OVER (PARTITION BY country, date, custvisitid_unified RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS device_new,
IF
(mix_app_web_session = 'mixed',
CONCAT('mixed_',MAX(app_os) OVER (PARTITION BY country, date, custvisitid_unified RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)),
browser) AS browser_new,
MAX(channel) OVER (PARTITION BY country, date, custvisitid_unified RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS channel_new,
IF
(mix_app_web_session = 'mixed',
MAX(app_os) OVER (PARTITION BY country, date, custvisitid_unified RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
os) AS os_new
FROM (
SELECT
*,
IF
(COUNT(DISTINCT webshop) OVER (PARTITION BY country, date, custvisitid_unified) > 1,
'mixed',
'single') AS mix_app_web_session
FROM ( # define whether custvisitid_unified has hits from both app and web
SELECT
*,
IF
(user_id_anonymous_wide IS NULL
AND mix_app_web_user = 'single',
custvisitid,
CONCAT(MAX(custvisitid) OVER (PARTITION BY country, date, user_id_anonymous_wide RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),cust_session_counter)) AS custvisitid_unified
FROM ( # create custvisitid_unified, in which the linked app and web hits have been assigned a unified custom visitid. Only apply a custom visitid to user_id_anonymous_wide = 'mixed' hits (since it is a bit tricky), otherwise just use the regular visitid from GA
SELECT
*,
IF
(COUNT(DISTINCT webshop) OVER (PARTITION BY country, date, user_id_anonymous_wide) > 1,
'mixed',
'single') AS mix_app_web_user,
(COUNT(new_session) OVER (PARTITION BY country, date, user_id_anonymous_wide ORDER BY timestamp_microsec)) + 1 AS cust_session_counter
FROM ( # define session counter
SELECT
*,
IF
((timestamp_microsec-prev_timestamp_microsec) > 2400000000,
'new',
NULL) AS new_session
FROM ( # Where timestamp is greater than 40 mins (actually 30 mins, but some margin is applied to be sure)
SELECT
*,
IF
(user_id_anonymous_wide IS NOT NULL,
LAG(timestamp_microsec,1) OVER (PARTITION BY country, date, user_id_anonymous_wide ORDER BY timestamp_microsec),
NULL) AS prev_timestamp_microsec
FROM ( # define previous timestamp to calculate difference in timestamp between consecutive hits
SELECT
*,
MAX(user_id_anonymous) OVER (PARTITION BY country, date, custvisitid RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS user_id_anonymous_wide,
IF
(webshop = 'appshop',
os,
NULL) AS app_os # user_id_anonymous_wide: define the user_id_anonymous values for all hits within the existing sessionid (initially only in 1 of the hits)
FROM (
# SELECT many dimensions FROM multiple tables (which resulted in 1 table with the use of UNION ALL's
) ))))))
An update: I fixed the issue by splitting up the query in 2 queries.
I am trying to replicate sessions by a custom dimension in BigQuery to the Google Analytics AII. I am only a few sessions off and I can't figure out what how to get an exact match.
My current understanding is that GA breaks sessions at midnight (because its data model relies on processing in day chunks). I tired to take this into account with the code below, but something is not quite right. Does anyone know how to get an exact match?
SELECT
CD12,
SUM(sessions) AS sessions
FROM (
SELECT
CD12,
CASE WHEN hitNumber = first_hit THEN visits ELSE 0 END AS sessions
FROM (
SELECT
fullVisitorId,
visitStartTime,
totals.visits,
hits.hitNumber,
CASE WHEN cd.index = 12 THEN cd.value END AS CD12,
MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
FROM `data-....`,
UNNEST(hits) AS hits,
UNNEST(hits.customDimensions) AS cd
)
)
WHERE CD12 ='0'
GROUP BY
CD12
ORDER BY
sessions DESC
I'm a bit surprised, how come successive sessions from user have visitNumber == 1 (it happens with more than one users). Doesn't visitNumber (session number for user) increments with each successive session?
see attach screenshot pls.
====
SELECT fullvisitorid, visitid, date, visitNumber, hitNumber, type, page.pagePath, isInteraction
FROM `122623284.ga_sessions_2017*` ga_sessions, unnest(hits) as ht
WHERE _TABLE_SUFFIX between '0101' and '0731'
AND fullvisitorid in ('3635735417215222540', '4036640811518552822', '800892955541145796')
ORDER BY fullvisitorid, visitid, hitnumber
Thanks in advance, if anyone any idea under what scenarios this can happen ?
cheers!
UPDATE (after #WillianFuks response)
It's still the same, after re-running the query that #WillianFuks suggested,
The observation here is the stark date difference between the successive visits :
188 days (red)
210 days (green)
184 days (blue)
Analytics does a lookback for the last session to increment the visitNumber count, but there is a limit on number of days it lookbacks upto, called as lookback window. I don't remember exactly for analytics but the lookback window generally ranges from 90 days to 180 days for various Google products.
Since it is not able to find the previous visit within the lookback window, it resets the visitNumber to 1 again.
Update: By default it is 6 months for Google Analytics.
As Elliott suggested in his comment, the problem most likely is due the duplication that happens when you apply UNNEST to the hits field.
You can confirm that by running this query:
SELECT
fullvisitorid fv,
visitid,
date,
visitNumber,
ARRAY(SELECT AS STRUCT hitNumber, type, page.pagePath AS pagePath, isInteraction FROM UNNEST(hits)) data
FROM `122623284.ga_sessions_2017*`
WHERE _TABLE_SUFFIX between '0101' and '0731'
AND fullvisitorid in ('3635735417215222540', '4036640811518552822', '800892955541145796')
LIMIT 1000
This will bring the fields inside hits without making the cross product (unnest operation) with the outer fields.
Background
In BigQuery, I'm trying to find the number of visitors that both visit one of two pages and purchase a specific product.
When I run each of the sub-queries, the numbers match exactly what I see in Google Analytics.
However, when I join them, the number is different than what I see in GA. I've had someone bring the results of the two sub-queries into Excel and do the equivalent, and their results equal what I'm seeing in BQ.
Details
Here's the query:
SELECT
ProductSessions.date AS date,
SUM(ProductTransactions.totalTransactions) transactions,
COUNT(ProductSessions.visitId) visited_product_sessions
FROM (
SELECT
visitId, date
FROM
`103554833.ga_sessions_20170219`
WHERE
EXISTS(
SELECT 1 FROM UNNEST(hits) h
WHERE REGEXP_CONTAINS(h.page.pagePath, r"^www.domain.com/(product|product2).html.*"))
GROUP BY visitID, date)
AS ProductSessions
LEFT JOIN (
SELECT
totals.transactions as totalTransactions,
visitId,
date
FROM
`103554833.ga_sessions_20170219`
WHERE
totals.transactions IS NOT NULL
AND EXISTS(
SELECT 1
FROM
UNNEST(hits) h,
UNNEST(h.product) prod
WHERE REGEXP_CONTAINS(prod.v2ProductName, r"^Product®$"))
GROUP BY
visitId, totals.transactions,
date) AS ProductTransactions
ON
ProductTransactions.visitId = ProductSessions.visitId
WHERE ProductTransactions.visitId is not null
GROUP BY
date
ORDER BY
date ASC
I'm expecting ProductTransactions.totalTransactions to replicate the number of transactions in Google Analytics when filtered with an advanced segment of both:
Sessions include Page matching RegEx: www.domain.com/(product|product2).html.*
Sessions include Product matches exactly: Product®
However, results in BG are about 20% higher than in GA.
Why the difference?
I'm just learning BigQuery so this might be a dumb question, but we want to get some statistics there and one of those is the total sessions in a given day.
To do so, I've queried in BQ:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
group each by fullvisitorid
)
(I'm using the table_query because later on we might increase the range of days)
This results in 1,075,137.
But in our Google Analytics Reports, in the "Audience Overview" section, the same day results:
This report is based on 1,026,641 sessions (100% of sessions).
There's always this difference of roughly ~5% despite of the day. So I'm wondering, even though the query is quite simple, is there any mistake we've made?
Is this difference expected to happen? I read through BigQuery's documentation but couldn't find anything on this issue.
Thanks in advance,
standardsql
Simply SUM(totals.visits) or when using COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) make sure totals.visits=1!
If you use visitId and you are not grouping per day, you will combine midnight-split-sessions!
Here are all scenarios:
SELECT
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) allSessionsUniquePerDay,
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitId AS STRING) )) allSessionsUniquePerSelectedTimeframe,
sum(totals.visits) interactiveSessionsUniquePerDay, -- equals GA UI sessions
COUNT(DISTINCT IF(totals.visits=1, CONCAT(fullVisitorId, CAST(visitId AS STRING)), NULL) ) interactiveSessionsUniquePerSelectedTimeframe,
SUM(IF(totals.visits=1,0,1)) nonInteractiveSessions
FROM
`project.dataset.ga_sessions_2017102*`
Wrap up:
fullVisitorId + visitId: useful to reconnect midnight-splits
fullVisitorId + visitStartTime: useful to take splits into account
totals.visits=1 for interaction sessions
fullVisitorId + visitStartTime where totals.visits=1: GA UI sessions (in case you need a session id)
SUM(totals.visits): simple GA UI sessions
fullVisitorId + visitId where totals.visits=1 and GROUP BY date: GA UI sessions with too many chances for errors and misunderstandings
After posting the question we got into contact with Google support and found that in Google Analytics only sessions that had an "event" being fired are actually counted.
In Bigquery you will find all sessions regardless whether they had an interaction or not.
In order to find the same result as in GA, you should filter by sessions with totals.visits = 1 in your BQ query (totals.visits is 1 only for sessions that had an event being fired).
That is:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
where totals.visits = 1
group each by fullvisitorid
)
The problem could be due to "COUNT DISTINCT".
According to this post:
COUNT DISTINCT is a statistical approximation for all results greater than 1000
You could try setting an additional COUNT parameter to improve accuracy at the expense of performance (see post), but I would first try:
SELECT COUNT( CONCAT( fullvisitorid,'_', STRING(visitid))) as sessions
from (table_query([40663402], 'timestamp(right(table_id,8)) between
timestamp("20150519") and timestamp("20150519")'))
What worked for me was this:
SELECT count(distinct sessionId) FROM(
SELECT CONCAT(clientId, "-", visitNumber, "-", date) as sessionId FROM `project-id.dataset-id.ga_sessions_*`
WHERE _table_suffix BETWEEN "20191001" AND "20191031" AND totals.visits = 1)
The explanation (found very well written in
this article: https://adswerve.com/blog/google-analytics-bigquery-tips-users-sessions-part-one/) is that when counting and dealing with sessions we should be careful because by default, Google Analytics breaks sessions that carryover midnight (time zone of the view). Therefore a same session can end up in two daily tables:
Image from article mentioned above
The code provided creates a sessionID by combining:
client id + visit number + date
while acknowledging the session break; the result will be in a human-readable format. Finally to match sessions in the Google Analytics UI, make sure to filter to only those with totals.visits = 1.