I am trying mimic this chart in GA:
But I have noticed that when I dont add date in my code the numbers match but when I add date the numbers seem doubling up.
Code:
SELECT
date,
COUNT(DISTINCT fullVisitorId) AS Users,
-- New Users (metric)
COUNT(DISTINCT(
CASE
WHEN totals.newVisits = 1 THEN fullVisitorId
ELSE
NULL
END)) AS New_Users,
-- Sessions (metric)
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))) AS Sessions,
-- Bounces (metric)
COUNT(DISTINCT
CASE
WHEN totals.bounces = 1 THEN CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))
ELSE
NULL
END
) AS Bounces,
-- Transactions (metric)
COUNT(DISTINCT hits.transaction.transactionId) AS Transactions,
--Revenue (metric)
SUM(hits.transaction.transactionRevenue)/1000000 AS Revenue
FROM
`ABC-ca-web.123.ga_sessions_*`, Unnest(hits) hits
WHERE trafficSource.campaign LIKE '%ABC%' and date between '20200801' AND '20200831'
This also happens if you count users by date in GA, which is the usual query operation.
You cannot sum users from different periods. For example, if user X has visited the site every day in a week, analyzing the entire period the number of users is 1, but if you analyze it day by day it is 1 on the first day, 1 on the second day, 1 the third day, etc ... because the same user was there every day. If you count the users by days the result is that you have 7 users, but in reality you have 1 user because it is the same user.
I suppose when you add "date", you'll be also adding a group by "date" as well, without which the query will error out.
When you take the distinct count of users per day, a user can be included in multiple days grouping. But when you drop the "date" field, then that user is included only once.
That's why you see double, but it could be any number same or greater than when you don't add "date"
Related
Trying to optimize a SQL query to find GA sessions containing pages where a "conversion" event occurs within 15 days. If page, e.g., "/example/product", contains a GA event "conversion" on June 15 then I want to count all sessions from June 1-15 that hit that page, regardless of whether or not those sessions contained a conversion event.
Is there a better way than joining to get this data? Perhaps with windowing?
I have a working query but it runs increasingly slowly when querying tables over longer time ranges and eventually fails. I've first selected only sessions with pages where a conversion happened over the entire time range of the query, then joined to those same pages with the date of conversion, then finally selected only the sessions where the date_diff between the conversion date and session date is between 0 and 15.
SELECT
date,
COUNT (DISTINCT sessionId) AS sessions
FROM (
SELECT
date,
CONCAT(CAST(visitId AS STRING),fullVisitorId) AS sessionId,
hits.page.pagePath AS pagepath
FROM
`[dataset].ga_sessions_201906*` t,
t.hits hits
WHERE
hits.page.pagePath IN (
SELECT
DISTINCT(hits.page.pagePath) AS pagepath
FROM
`[dataset].ga_sessions_201906*` t,
t.hits hits
WHERE
REGEXP_CONTAINS( hits.eventInfo.eventAction, r'(?i)conversion'))
GROUP BY
date,
visitId,
fullVisitorId,
pagepath)
INNER JOIN (
SELECT
DISTINCT(hits.page.pagePath) AS pagepath,
date AS conversionDate
FROM
`[dataset].ga_sessions_201906*` t,
t.hits hits
WHERE
REGEXP_CONTAINS( hits.eventInfo.eventAction, r'(?i)conversion')
ORDER BY
pagepath,
conversionDate)
USING
(pagepath)
WHERE
DATE_DIFF(PARSE_DATE('%Y%m%d',
conversionDate),PARSE_DATE('%Y%m%d',
date), DAY) BETWEEN 0 AND 15
GROUP BY
date
ORDER BY
date
It does produce expected results over shorter time periods but in testing over longer periods the query failed with the following message: "The service is currently unavailable."
Thank you for stopping by! I would be grateful to (re)create the ultimate GA Session Funnel in Big Query. The focus is on the funnel per session, with certain, but not necessarily sequentially visited pages during one session.
The solution should count sessions as COUNT( DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))).
Further, the funnel should be of the form that every funnel step can only be reached if the previous step has been completed within a session (e.g. the fourth step should only be counted if steps 1 - 3 have been visited during the session). However, the steps do not need to be performed consecutively
That is, unfortunately, why this example, which I like a lot, would not work for me. It returns numbers for visits of totals.visits. Also, I need to use REGXP_CONTAINS for the pages, as I do not have events (or custom dimensions) on my pages for the funnel steps. For the original query (for every respective step)
SUM((SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'landing_page' LIMIT 1)) Landing_Page
I tried:
COUNT( DISTINCT( SELECT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) FROM UNNEST(GA.hits) WHERE REGEXP_CONTAINS(hits.page.pagePath, r”myfunnelpage”)
However, my funnel step visits are actually more than my total “sessions” as per COUNT( DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING))) AS overday_sessions.
Another example looks at user sessions (I am incredibly impressed, also absolutely intimidated, props to #Martin)
Allegedly, there is a website that ought to have it all is down when I wrote this #StuffGettingLostOnline
My approach would look something like this. But it returns only sessions with single page views, not sequential ones:
SELECT
date,
COUNT( DISTINCT( SELECT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) FROM UNNEST(GA.hits) WHERE REGEXP_CONTAINS(hits.page.pagePath, r"productoverviewpage") LIMIT 1)) AS product_overview_s1,
COUNT( DISTINCT( SELECT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) FROM UNNEST(GA.hits) WHERE EXISTS(SELECT 1 FROM UNNEST(GA.hitS) WHERE REGEXP_CONTAINS(hits.page.pagePath, r"productoverviewregex")) AND REGEXP_CONTAINS(hits.page.pagePath, cartoverviewregex") LIMIT 1)) AS cart_overview_s2
FROM
data as GA,
UNNEST(GA.hits) AS hits
WHERE hits.type = "PAGE"
AND
TRUE IN UNNEST(
[REGEXP_CONTAINS(hits.page.pagePath, r"productoverviewpage"),
REGEXP_CONTAINS(hits.page.pagePath, r"cartoverviewregex""]
)
Any ideas? Anyone able to recreate the ultimate big query funnel using the “correct” session count?
You can use inline subqueries to check for the individual steps of the funnel:
WITH
sessions AS (
SELECT
(
SELECT
hits
FROM
UNNEST(hits) hits
WHERE
hits.page.pagePath = "/"
) first_step,
(
SELECT
hits
FROM
UNNEST(hits) hits
WHERE
hits.page.pagePath = "/basket"
) second_step
FROM
`project.dataset.ga_sessions_*`)
SELECT
COUNT(first_step) sessions_step_one,
COUNTIF(first_step.hitNumber < second_step.hitNumber) sessions_step_two
FROM
sessions
I'm a bit surprised, how come successive sessions from user have visitNumber == 1 (it happens with more than one users). Doesn't visitNumber (session number for user) increments with each successive session?
see attach screenshot pls.
====
SELECT fullvisitorid, visitid, date, visitNumber, hitNumber, type, page.pagePath, isInteraction
FROM `122623284.ga_sessions_2017*` ga_sessions, unnest(hits) as ht
WHERE _TABLE_SUFFIX between '0101' and '0731'
AND fullvisitorid in ('3635735417215222540', '4036640811518552822', '800892955541145796')
ORDER BY fullvisitorid, visitid, hitnumber
Thanks in advance, if anyone any idea under what scenarios this can happen ?
cheers!
UPDATE (after #WillianFuks response)
It's still the same, after re-running the query that #WillianFuks suggested,
The observation here is the stark date difference between the successive visits :
188 days (red)
210 days (green)
184 days (blue)
Analytics does a lookback for the last session to increment the visitNumber count, but there is a limit on number of days it lookbacks upto, called as lookback window. I don't remember exactly for analytics but the lookback window generally ranges from 90 days to 180 days for various Google products.
Since it is not able to find the previous visit within the lookback window, it resets the visitNumber to 1 again.
Update: By default it is 6 months for Google Analytics.
As Elliott suggested in his comment, the problem most likely is due the duplication that happens when you apply UNNEST to the hits field.
You can confirm that by running this query:
SELECT
fullvisitorid fv,
visitid,
date,
visitNumber,
ARRAY(SELECT AS STRUCT hitNumber, type, page.pagePath AS pagePath, isInteraction FROM UNNEST(hits)) data
FROM `122623284.ga_sessions_2017*`
WHERE _TABLE_SUFFIX between '0101' and '0731'
AND fullvisitorid in ('3635735417215222540', '4036640811518552822', '800892955541145796')
LIMIT 1000
This will bring the fields inside hits without making the cross product (unnest operation) with the outer fields.
Currently experiencing an issue with bigquery query I'm using but specifically with roll up properties. The following query shows exactly double the amount of visits between the two visit calcualtions (visits and visits2) on certain dates. On other dates the numbers match and then on others they approximately double. Predominantly though, visits is double visits2. Any ideas why?
SELECT
date,
geoNetwork.country AS Country,
SUM(totals.visits) AS visits,
COUNT(DISTINCT CONCAT(CAST (visitId AS string),fullvisitorId)) AS visits2,
COUNT(DISTINCT(fullVisitorId)) AS Users,
SUM(totals.newVisits) AS new_,
SUM(totals.pageviews) AS PAGEVIEWS,
SUM(totals.bounces) AS BOUNCES,
SUM(CASE
WHEN device.isMobile = TRUE THEN (totals.visits)
ELSE 0 END) mobilevisits,
SUM(CASE
WHEN trafficSource.medium = 'organic' THEN (totals.visits)
ELSE 0 END) organicvisits,
SUM(CASE
WHEN EXISTS( SELECT 1 FROM UNNEST(hits) hits WHERE REGEXP_CONTAINS(hits.eventInfo.eventAction,'register$|registersuccess|new registration|account signup|registro')) THEN 1
ELSE 0 END) AS NewRegistrations,
SUM(CASE
WHEN EXISTS( SELECT 1 FROM UNNEST(hits) hits WHERE REGEXP_CONTAINS(hits.eventInfo.eventAction, 'add to cart|add to bag|click to buy|ass to basket|comprar|addtobasket::')) THEN 1
ELSE 0 END) AS ClickToBuy,
SUM(totals.transactions) AS Transactions,
SUM(totals.transactionRevenue) /1000000 AS Revenue
FROM
`project-1&&&&.dataset.ga_sessions*`
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2016-01-01')
AND TIMESTAMP('2017-05-08')
GROUP BY
date,
Country
ORDER BY
visits DESC;
It looks like the issue is also happening with bounces, pageviews, mobilevisits and organicvisits.
If I have to go with the more manual version of visits2, I'm also going to need to the same for the other metrics can anyone point me in the direction of what the more accurate calculation for these are e.g. how to calculate bounces without using totals.bounces?
Thanks
Elliot's comment was correct, it looks like I had to add an underscore (I went further and added "_2017" to ensure that the query was not querying the intra day tables. Interestingly this only seems to be an issue for roll up properties linked to BigQuery as this is not needed for normal properties, the wildcard format used in the original works just fine for the usual Google Analytics properties.
I'm just learning BigQuery so this might be a dumb question, but we want to get some statistics there and one of those is the total sessions in a given day.
To do so, I've queried in BQ:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
group each by fullvisitorid
)
(I'm using the table_query because later on we might increase the range of days)
This results in 1,075,137.
But in our Google Analytics Reports, in the "Audience Overview" section, the same day results:
This report is based on 1,026,641 sessions (100% of sessions).
There's always this difference of roughly ~5% despite of the day. So I'm wondering, even though the query is quite simple, is there any mistake we've made?
Is this difference expected to happen? I read through BigQuery's documentation but couldn't find anything on this issue.
Thanks in advance,
standardsql
Simply SUM(totals.visits) or when using COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) make sure totals.visits=1!
If you use visitId and you are not grouping per day, you will combine midnight-split-sessions!
Here are all scenarios:
SELECT
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) allSessionsUniquePerDay,
COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitId AS STRING) )) allSessionsUniquePerSelectedTimeframe,
sum(totals.visits) interactiveSessionsUniquePerDay, -- equals GA UI sessions
COUNT(DISTINCT IF(totals.visits=1, CONCAT(fullVisitorId, CAST(visitId AS STRING)), NULL) ) interactiveSessionsUniquePerSelectedTimeframe,
SUM(IF(totals.visits=1,0,1)) nonInteractiveSessions
FROM
`project.dataset.ga_sessions_2017102*`
Wrap up:
fullVisitorId + visitId: useful to reconnect midnight-splits
fullVisitorId + visitStartTime: useful to take splits into account
totals.visits=1 for interaction sessions
fullVisitorId + visitStartTime where totals.visits=1: GA UI sessions (in case you need a session id)
SUM(totals.visits): simple GA UI sessions
fullVisitorId + visitId where totals.visits=1 and GROUP BY date: GA UI sessions with too many chances for errors and misunderstandings
After posting the question we got into contact with Google support and found that in Google Analytics only sessions that had an "event" being fired are actually counted.
In Bigquery you will find all sessions regardless whether they had an interaction or not.
In order to find the same result as in GA, you should filter by sessions with totals.visits = 1 in your BQ query (totals.visits is 1 only for sessions that had an event being fired).
That is:
select sum(sessions) as total_sessions from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
where totals.visits = 1
group each by fullvisitorid
)
The problem could be due to "COUNT DISTINCT".
According to this post:
COUNT DISTINCT is a statistical approximation for all results greater than 1000
You could try setting an additional COUNT parameter to improve accuracy at the expense of performance (see post), but I would first try:
SELECT COUNT( CONCAT( fullvisitorid,'_', STRING(visitid))) as sessions
from (table_query([40663402], 'timestamp(right(table_id,8)) between
timestamp("20150519") and timestamp("20150519")'))
What worked for me was this:
SELECT count(distinct sessionId) FROM(
SELECT CONCAT(clientId, "-", visitNumber, "-", date) as sessionId FROM `project-id.dataset-id.ga_sessions_*`
WHERE _table_suffix BETWEEN "20191001" AND "20191031" AND totals.visits = 1)
The explanation (found very well written in
this article: https://adswerve.com/blog/google-analytics-bigquery-tips-users-sessions-part-one/) is that when counting and dealing with sessions we should be careful because by default, Google Analytics breaks sessions that carryover midnight (time zone of the view). Therefore a same session can end up in two daily tables:
Image from article mentioned above
The code provided creates a sessionID by combining:
client id + visit number + date
while acknowledging the session break; the result will be in a human-readable format. Finally to match sessions in the Google Analytics UI, make sure to filter to only those with totals.visits = 1.