I'm a bit surprised, how come successive sessions from user have visitNumber == 1 (it happens with more than one users). Doesn't visitNumber (session number for user) increments with each successive session?
see attach screenshot pls.
====
SELECT fullvisitorid, visitid, date, visitNumber, hitNumber, type, page.pagePath, isInteraction
FROM `122623284.ga_sessions_2017*` ga_sessions, unnest(hits) as ht
WHERE _TABLE_SUFFIX between '0101' and '0731'
AND fullvisitorid in ('3635735417215222540', '4036640811518552822', '800892955541145796')
ORDER BY fullvisitorid, visitid, hitnumber
Thanks in advance, if anyone any idea under what scenarios this can happen ?
cheers!
UPDATE (after #WillianFuks response)
It's still the same, after re-running the query that #WillianFuks suggested,
The observation here is the stark date difference between the successive visits :
188 days (red)
210 days (green)
184 days (blue)
Analytics does a lookback for the last session to increment the visitNumber count, but there is a limit on number of days it lookbacks upto, called as lookback window. I don't remember exactly for analytics but the lookback window generally ranges from 90 days to 180 days for various Google products.
Since it is not able to find the previous visit within the lookback window, it resets the visitNumber to 1 again.
Update: By default it is 6 months for Google Analytics.
As Elliott suggested in his comment, the problem most likely is due the duplication that happens when you apply UNNEST to the hits field.
You can confirm that by running this query:
SELECT
fullvisitorid fv,
visitid,
date,
visitNumber,
ARRAY(SELECT AS STRUCT hitNumber, type, page.pagePath AS pagePath, isInteraction FROM UNNEST(hits)) data
FROM `122623284.ga_sessions_2017*`
WHERE _TABLE_SUFFIX between '0101' and '0731'
AND fullvisitorid in ('3635735417215222540', '4036640811518552822', '800892955541145796')
LIMIT 1000
This will bring the fields inside hits without making the cross product (unnest operation) with the outer fields.
Related
I am trying mimic this chart in GA:
But I have noticed that when I dont add date in my code the numbers match but when I add date the numbers seem doubling up.
Code:
SELECT
date,
COUNT(DISTINCT fullVisitorId) AS Users,
-- New Users (metric)
COUNT(DISTINCT(
CASE
WHEN totals.newVisits = 1 THEN fullVisitorId
ELSE
NULL
END)) AS New_Users,
-- Sessions (metric)
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))) AS Sessions,
-- Bounces (metric)
COUNT(DISTINCT
CASE
WHEN totals.bounces = 1 THEN CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))
ELSE
NULL
END
) AS Bounces,
-- Transactions (metric)
COUNT(DISTINCT hits.transaction.transactionId) AS Transactions,
--Revenue (metric)
SUM(hits.transaction.transactionRevenue)/1000000 AS Revenue
FROM
`ABC-ca-web.123.ga_sessions_*`, Unnest(hits) hits
WHERE trafficSource.campaign LIKE '%ABC%' and date between '20200801' AND '20200831'
This also happens if you count users by date in GA, which is the usual query operation.
You cannot sum users from different periods. For example, if user X has visited the site every day in a week, analyzing the entire period the number of users is 1, but if you analyze it day by day it is 1 on the first day, 1 on the second day, 1 the third day, etc ... because the same user was there every day. If you count the users by days the result is that you have 7 users, but in reality you have 1 user because it is the same user.
I suppose when you add "date", you'll be also adding a group by "date" as well, without which the query will error out.
When you take the distinct count of users per day, a user can be included in multiple days grouping. But when you drop the "date" field, then that user is included only once.
That's why you see double, but it could be any number same or greater than when you don't add "date"
Trying to optimize a SQL query to find GA sessions containing pages where a "conversion" event occurs within 15 days. If page, e.g., "/example/product", contains a GA event "conversion" on June 15 then I want to count all sessions from June 1-15 that hit that page, regardless of whether or not those sessions contained a conversion event.
Is there a better way than joining to get this data? Perhaps with windowing?
I have a working query but it runs increasingly slowly when querying tables over longer time ranges and eventually fails. I've first selected only sessions with pages where a conversion happened over the entire time range of the query, then joined to those same pages with the date of conversion, then finally selected only the sessions where the date_diff between the conversion date and session date is between 0 and 15.
SELECT
date,
COUNT (DISTINCT sessionId) AS sessions
FROM (
SELECT
date,
CONCAT(CAST(visitId AS STRING),fullVisitorId) AS sessionId,
hits.page.pagePath AS pagepath
FROM
`[dataset].ga_sessions_201906*` t,
t.hits hits
WHERE
hits.page.pagePath IN (
SELECT
DISTINCT(hits.page.pagePath) AS pagepath
FROM
`[dataset].ga_sessions_201906*` t,
t.hits hits
WHERE
REGEXP_CONTAINS( hits.eventInfo.eventAction, r'(?i)conversion'))
GROUP BY
date,
visitId,
fullVisitorId,
pagepath)
INNER JOIN (
SELECT
DISTINCT(hits.page.pagePath) AS pagepath,
date AS conversionDate
FROM
`[dataset].ga_sessions_201906*` t,
t.hits hits
WHERE
REGEXP_CONTAINS( hits.eventInfo.eventAction, r'(?i)conversion')
ORDER BY
pagepath,
conversionDate)
USING
(pagepath)
WHERE
DATE_DIFF(PARSE_DATE('%Y%m%d',
conversionDate),PARSE_DATE('%Y%m%d',
date), DAY) BETWEEN 0 AND 15
GROUP BY
date
ORDER BY
date
It does produce expected results over shorter time periods but in testing over longer periods the query failed with the following message: "The service is currently unavailable."
I am trying to replicate sessions by a custom dimension in BigQuery to the Google Analytics AII. I am only a few sessions off and I can't figure out what how to get an exact match.
My current understanding is that GA breaks sessions at midnight (because its data model relies on processing in day chunks). I tired to take this into account with the code below, but something is not quite right. Does anyone know how to get an exact match?
SELECT
CD12,
SUM(sessions) AS sessions
FROM (
SELECT
CD12,
CASE WHEN hitNumber = first_hit THEN visits ELSE 0 END AS sessions
FROM (
SELECT
fullVisitorId,
visitStartTime,
totals.visits,
hits.hitNumber,
CASE WHEN cd.index = 12 THEN cd.value END AS CD12,
MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
FROM `data-....`,
UNNEST(hits) AS hits,
UNNEST(hits.customDimensions) AS cd
)
)
WHERE CD12 ='0'
GROUP BY
CD12
ORDER BY
sessions DESC
Background
In BigQuery, I'm trying to find the number of visitors that both visit one of two pages and purchase a specific product.
When I run each of the sub-queries, the numbers match exactly what I see in Google Analytics.
However, when I join them, the number is different than what I see in GA. I've had someone bring the results of the two sub-queries into Excel and do the equivalent, and their results equal what I'm seeing in BQ.
Details
Here's the query:
SELECT
ProductSessions.date AS date,
SUM(ProductTransactions.totalTransactions) transactions,
COUNT(ProductSessions.visitId) visited_product_sessions
FROM (
SELECT
visitId, date
FROM
`103554833.ga_sessions_20170219`
WHERE
EXISTS(
SELECT 1 FROM UNNEST(hits) h
WHERE REGEXP_CONTAINS(h.page.pagePath, r"^www.domain.com/(product|product2).html.*"))
GROUP BY visitID, date)
AS ProductSessions
LEFT JOIN (
SELECT
totals.transactions as totalTransactions,
visitId,
date
FROM
`103554833.ga_sessions_20170219`
WHERE
totals.transactions IS NOT NULL
AND EXISTS(
SELECT 1
FROM
UNNEST(hits) h,
UNNEST(h.product) prod
WHERE REGEXP_CONTAINS(prod.v2ProductName, r"^Product®$"))
GROUP BY
visitId, totals.transactions,
date) AS ProductTransactions
ON
ProductTransactions.visitId = ProductSessions.visitId
WHERE ProductTransactions.visitId is not null
GROUP BY
date
ORDER BY
date ASC
I'm expecting ProductTransactions.totalTransactions to replicate the number of transactions in Google Analytics when filtered with an advanced segment of both:
Sessions include Page matching RegEx: www.domain.com/(product|product2).html.*
Sessions include Product matches exactly: Product®
However, results in BG are about 20% higher than in GA.
Why the difference?
I'm just learning BigQuery so this might be a dumb question, but we want to get some statistics there and one of those is the total sessions in a given day.
To do so, I've queried in BQ:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
group each by fullvisitorid
)
(I'm using the table_query because later on we might increase the range of days)
This results in 1,075,137.
But in our Google Analytics Reports, in the "Audience Overview" section, the same day results:
This report is based on 1,026,641 sessions (100% of sessions).
There's always this difference of roughly ~5% despite of the day. So I'm wondering, even though the query is quite simple, is there any mistake we've made?
Is this difference expected to happen? I read through BigQuery's documentation but couldn't find anything on this issue.
Thanks in advance,
standardsql
Simply SUM(totals.visits) or when using COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) make sure totals.visits=1!
If you use visitId and you are not grouping per day, you will combine midnight-split-sessions!
Here are all scenarios:
SELECT
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) allSessionsUniquePerDay,
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitId AS STRING) )) allSessionsUniquePerSelectedTimeframe,
sum(totals.visits) interactiveSessionsUniquePerDay, -- equals GA UI sessions
COUNT(DISTINCT IF(totals.visits=1, CONCAT(fullVisitorId, CAST(visitId AS STRING)), NULL) ) interactiveSessionsUniquePerSelectedTimeframe,
SUM(IF(totals.visits=1,0,1)) nonInteractiveSessions
FROM
`project.dataset.ga_sessions_2017102*`
Wrap up:
fullVisitorId + visitId: useful to reconnect midnight-splits
fullVisitorId + visitStartTime: useful to take splits into account
totals.visits=1 for interaction sessions
fullVisitorId + visitStartTime where totals.visits=1: GA UI sessions (in case you need a session id)
SUM(totals.visits): simple GA UI sessions
fullVisitorId + visitId where totals.visits=1 and GROUP BY date: GA UI sessions with too many chances for errors and misunderstandings
After posting the question we got into contact with Google support and found that in Google Analytics only sessions that had an "event" being fired are actually counted.
In Bigquery you will find all sessions regardless whether they had an interaction or not.
In order to find the same result as in GA, you should filter by sessions with totals.visits = 1 in your BQ query (totals.visits is 1 only for sessions that had an event being fired).
That is:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
where totals.visits = 1
group each by fullvisitorid
)
The problem could be due to "COUNT DISTINCT".
According to this post:
COUNT DISTINCT is a statistical approximation for all results greater than 1000
You could try setting an additional COUNT parameter to improve accuracy at the expense of performance (see post), but I would first try:
SELECT COUNT( CONCAT( fullvisitorid,'_', STRING(visitid))) as sessions
from (table_query([40663402], 'timestamp(right(table_id,8)) between
timestamp("20150519") and timestamp("20150519")'))
What worked for me was this:
SELECT count(distinct sessionId) FROM(
SELECT CONCAT(clientId, "-", visitNumber, "-", date) as sessionId FROM `project-id.dataset-id.ga_sessions_*`
WHERE _table_suffix BETWEEN "20191001" AND "20191031" AND totals.visits = 1)
The explanation (found very well written in
this article: https://adswerve.com/blog/google-analytics-bigquery-tips-users-sessions-part-one/) is that when counting and dealing with sessions we should be careful because by default, Google Analytics breaks sessions that carryover midnight (time zone of the view). Therefore a same session can end up in two daily tables:
Image from article mentioned above
The code provided creates a sessionID by combining:
client id + visit number + date
while acknowledging the session break; the result will be in a human-readable format. Finally to match sessions in the Google Analytics UI, make sure to filter to only those with totals.visits = 1.