visitNumber getting reset in BigQuery Export of Google Analytics - google-analytics

We've used the following query to inspect visitNumber over time and found that for a particular fullVisitorId they can have more than one 'first' visit.
select
count(distinct fullVisitorId) as users,
newVisits
From(
select fullVisitorId, visitNumber, count(distinct visitId) as newVisits
from table_date_range([91311726.ga_sessions_], timestamp('20151101'), timestamp('20161124') )
where visitNumber = 1
group by fullVisitorId, visitNumber )
group by newVisits;
Result:
| users | newVisits |
|-----------|------------|
| 18 | 3 |
| 26041561 | 1 |
| 237792 | 2 |
My understanding is that for Universal Analytics the visitNumber is a counter on the Google Analytics backend that iterates for each new session per fullVisitorId, so how is it possible to have more than one session with vistNumber = 1?

There are 2 main causes for this.
Visits spanning multiple day boundaries. Say a visit starts at 20151101 #11:45pm and lasts until 20151102 # 1:00am This can create 2 different sessions but the visitNumber won't be incremented.
If a user last session was over 183 days ago it will be considered a new user and it's visitNumber will reset to 1. The reason is because Analytics has to do a lookback to see when was the last session to increase the visitNumber count, but the maximum lookback is 183 days. So maybe a user visited on 20151101 and then only came back on 20160701, this would cause both visits to have a visitNumber=1

Related

How to create a conversion funnel based on pages and events for GA data in BigQuery

I have the following data in BigQuery:
date fullVisitorId sessionId hitNumber type url eventCategory eventAction eventLabel
20210101 973454546035798949 973454546035798949162783837520210101 1 PAGE homepage.com Null Null Null
20210101 973454546035798949 973454546035798949162783837520210101 2 EVENT homepage.com/purchase View Book Harry_Potter
20210101 973454546035798949 973454546035798949162783837520210101 3 EVENT homepage.com/purchase Purchase Book Harry_Potter
...
I want to create a conversion funnel based on URLs and events, not necessarily sequential. For example, I want to calculate the number of distinct users (fullVisitorId) and the number of distinct sessions (sessionId) in which:
Users visited the homepage (homepage.com).
Then the event with Category View, Action Book and Label Harry_Potter was triggered,
Then the event with Category Purchase, Action Book and Label Harry_Potter was triggered.
Again the hits are not necessarily sequential, which means that the hit numbers could be 1, 4, and 8, respectively, for these 3 steps. Also, the real number of desired steps is more than 10.
Ideally, the final results should look like this:
Type Date Step 1 Step 2 Step 3 Step 4
Users 01/01/2021 120 110 90 ...
Users 02/01/2021 130 80 70 ...
Sessions 01/01/2021 200 120 100 ...
Sessions 02/01/2021 220 80 70 ...
where Step 1, Step 2, and Step 3 represent the number of users and sessions in which the particular step was done.
Any ideas? Thanks!
You will have to do something like this, below SQL code. For every condition u can have a CTE and then join.
WITH STEP1 AS
(
SELECT fullVisitorId, Date, SUM(hitNumber) AS STEP1
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE STARTS_WITH(url, "homepage.com") AND fullVisitorId IS NOT NULL
GROUP BY fullVisitorId, Date
),
STEP2 AS
(
SELECT fullVisitorId, Date, SUM(hitNumber) AS STEP2
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE eventCategory = 'View' AND eventAction = 'Book' AND eventLabel = 'Harry_Potter' AND fullVisitorId IS NOT NULL
GROUP BY fullVisitorId, Date
),
STEP3 AS
(
SELECT fullVisitorId, Date, SUM(hitNumber) AS STEP3
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE eventCategory = 'Purchase' AND eventAction = 'Book' AND eventLabel = 'Harry_Potter' AND fullVisitorId IS NOT NULL
GROUP BY fullVisitorId, Date
),
JOINED_DATA AS
(
SELECT 'Users' AS Type,
coalesce(SUB_QUERY.Date,STEP3.Date),
STEP1.STEP1,
STEP2.STEP2,
STEP3.STEP3
FROM STEP3 FULL OUTER JOIN
(
SELECT coalesce(STEP1.fullVisitorId,STEP2.fullVisitorId) AS fullVisitorId,
coalesce(STEP1.Date,STEP2.Date) AS Date
FROM STEP1 FULL OUTER JOIN STEP2
ON STEP1.DATE = STEP2.DATE AND STEP1.fullVisitorId = STEP2.fullVisitorId
) AS SUB_QUERY
ON STEP3.fullVisitorId = SUB_QUERY.fullVisitorId AND STEP3.Date = SUB_QUERY.Date
)
SELECT * FROM JOINED_DATA

Calculate Count of users every month in Kusto query language

I have a table named tab1:
Timestamp Username. sessionid
12-12-2020. Ravi. abc123
12-12-2020. Hari. oipio878
12-12-2020. Ravi. ytut987
11-12-2020. Ram. def123
10-12-2020. Ravi. jhgj54
10-12-2020. Shiv. qwee090
10-12-2020. bob. rtet4535
30-12-2020. sita. jgjye56
I want to count the number of distinct Usernames per day, so that the output would be:
day. count
10-12-2020. 3
11-12-2020. 1
12-12-2020. 2
30-12-2020. 1
Tried query:
tab1
| where timestamp > datetime(01-08-2020)
| range timestamp from datetime(01-08-2020) to now() step 1d
| extend day = dayofmonth(timestamp)
| distinct Username
| count
| project day, count
To get a very close estimation of the number of Usernames per day, just run this (the number won't be accurate, see details here):
tab1
| summarize dcount(Username) by bin(Timestamp, 1d)
If you want accurate results, then you should do this (just note that the query will be less performant than the previous one, and will only work if you have up to 1,000,000 usernames / day):
tab1
| summarize make_set(Username) by bin(Timestamp, 1d)
| project Timestamp, Count = array_length(set_Username)

Rolling 7 day uniques & 31 day uniques in BigQuery w/ Firebase

I'm trying to setup a rolling 7 day users & rolling 31 day users in BigQuery (w/ Firebase) using the following query. I want it where for each day it examines the previous 31 days as well as 7 days. I've been stuck and getting the message:
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
The query:
With events AS (
SELECT PARSE_DATE("%Y%m%d", event_date) as event_date, user_pseudo_id FROM `my_data_table.analytics_178206500.events_*`
Where _table_suffix NOT LIKE "i%" AND event_name = "user_engagement"
GROUP BY 1, 2
),
DAU AS (
SELECT event_date as date, COUNT(DISTINCT(user_pseudo_id)) AS dau
From events
GROUP BY 1
)
SELECT DAU.date, DAU.dau,
(
SELECT count(distinct(user_pseudo_id))
FROM events
WHERE events.event_date BETWEEN DATE_SUB(DAU.date, INTERVAL 29 DAY) and dau.date
) as mau,
(
SELECT count(distinct(user_pseudo_id))
FROM events
WHERE events.event_date BETWEEN DATE_SUB(DAU.date, INTERVAL 7 DAY) and dau.date
) as wau
FROM DAU
ORDER BY 1 DESC
I'm able to get the DAU part but the last 7 day users (WAU) & last 31 day users (MAU) aren't coming through. I have tried to CROSS JOIN DAU w/ events but I get the following results GraphResults
Any pointers would be greatly appreciated

Big Query landing page figures not consistent with Google Analytics interface

I'm using BigQuery to report on Google Analytics data. I'm trying to recreate landing page data using BigQuery.
The following query reports 18% fewer sessions than in the Google Analytics interface:
SELECT DISTINCT
fullVisitorId,
visitID,
h.page.pagePath AS LandingPage
FROM
`project-name.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
hitNumber = 1
AND h.type = 'PAGE'
AND _TABLE_SUFFIX BETWEEN '20170331' AND '20170331'
ORDER BY fullVisitorId DESC
Where am I going wrong with my approach? Why can't I get to within a small margin of the number in the GA interface's reported figure?
Multiple reasons :
1.Big Query for equivalent landing page:
SELECT
LandingPage,
COUNT(sessionId) AS Sessions,
100 * SUM(totals.bounces)/COUNT(sessionId) AS BounceRate,
AVG(totals.pageviews) AS AvgPageviews,
SUM(totals.timeOnSite)/COUNT(sessionId) AS AvgTimeOnSite,
from(
SELECT
CONCAT(fullVisitorId,STRING(visitId)) AS sessionID,
totals.bounces,
totals.pageviews,
totals.timeOnSite,
hits.page.pagePath AS landingPage
FROM (
SELECT
fullVisitorId,
visitId,
hits.page.pagePath,
totals.bounces,
totals.pageviews,
totals.timeOnSite,
MIN(hits.hitNumber) WITHIN RECORD AS firstHit,
hits.hitNumber AS hitNumber
FROM (TABLE_DATE_RANGE ([XXXYYYZZZ.ga_sessions_],TIMESTAMP('2016-08-01'), TIMESTAMP ('2016-08-31')))
WHERE
hits.type = 'PAGE'
AND hits.page.pagePath'')
WHERE
hitNumber = firstHit)
GROUP BY
LandingPage
ORDER BY
Sessions DESC,
LandingPage
Next :
Pre-calculated data -- pre-aggregated tables
These are the precalculated data that Google uses to speed up the UI. Google does not specify when this is done but it can be at any point of the time. These are known as pre-aggregated tables
So if you compare the numbers from GA UI to your Big Query output, you will always see a discrepancy. Please go ahead and rely on your big query data .
You can achieve the same thing by simply adding the below to your select statement:
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
I can get a 1 to 1 match with the GA UI on my end when I run something like below, which is a bit more concise than the original answer:
SELECT DISTINCT
a.landingpage
,COUNT(DISTINCT(a.sessionId)) sessions
,SUM(a.bounces) bounces
,AVG(a.avg_pages) avg_pages
,(SUM(tos)/COUNT(DISTINCT(a.sessionId)))/60 session_duration
FROM
(
SELECT DISTINCT
CONCAT(CAST(fullVisitorId AS STRING),CAST(visitStartTime AS STRING)) sessionId
,(SELECT page.pagePath FROM UNNEST(hits) WHERE hitnumber = (SELECT MIN(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE')) landingpage
,totals.bounces bounces
,totals.timeonsite tos
,(SELECT COUNT(hitnumber) FROM UNNEST(hits) WHERE type = 'PAGE') avg_pages
FROM `tablename_*`
WHERE _TABLE_SUFFIX >= '20180801'
AND _TABLE_SUFFIX <= '20180808'
AND totals.visits = 1
) a
GROUP BY 1
another way here! you can get the same number :
SELECT
LandingPage,
COUNT(DISTINCT(sessionID)) AS sessions
FROM(
SELECT
CONCAT(fullVisitorId,CAST(visitId AS STRING)) AS sessionID,
FIRST_VALUE(hits.page.pagePath) OVER (PARTITION BY CONCAT(fullVisitorId,CAST(visitId AS STRING)) ORDER BY hits.hitNumber ASC ) AS LandingPage
FROM
`xxxxxxxx1.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND hits.type ='PAGE'
GROUP BY fullVisitorId, visitId, sessionID,hits.page.pagePath,hits.hitNumber
)
GROUP BY LandingPage
ORDER BY sessions DESC
There is a hit.isEntrance field in the schema that can be used for this purpose.
The example below would show you yesterday's landing pages:
#standardSQL
select
date,
hits.page.pagePath as landingPage,
sum(totals.visits) as visits,
sum(totals.bounces) as bounces,
sum(totals.transactions) as transactions
from
`project.dataset.ga_sessions_*`,
unnest(hits) as hits
where
(_table_suffix
between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
and hits.isEntrance = True
and totals.visits = 1 #avoid counting midnight-split sessions
group by
1, 2
order by 3 desc
There is still one source of discrepancy though, which comes from the sessions without a landing page (if you check in GA in the landing pages report, there will sometimes be a (not set) value.
In order to include those as well, you can do:
with
landing_pages_set as (
select
concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string)) as fullVisitId,
hits.page.pagePath as virtualPagePath
from
`project.dataset.ga_sessions_*`,
unnest(hits) as hits
where
(_table_suffix
between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
and totals.visits = 1 #avoid counting midnight-split sessions
and hits.isEntrance = TRUE
group by 1, 2
),
landing_pages_not_set as (
select
concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string)) as fullVisitId,
date,
"(not set)" as virtualPagePath,
count(distinct concat(cast(fullVisitorId as string), cast(visitId as string), cast(date as string))) as visits,
sum(totals.bounces) as bounces,
sum(totals.transactions) as transactions
from
`project.dataset.ga_sessions_*`
where
(_table_suffix
between format_date("%Y%m%d", date_sub(current_date(), interval 1 day))
and format_date("%Y%m%d", date_sub(current_date(), interval 1 day)))
and totals.visits = 1 #avoid counting midnight-split sessions
group by 1, 2, 3
),
landing_pages as (
select
l.fullVisitId as fullVisitId,
date,
coalesce(r.virtualPagePath, l.virtualPagePath) as virtualPagePath,
visits,
bounces,
transactions
from
landing_pages_not_set l left join landing_pages_set r on l.fullVisitId = r.fullVisitId
)
select virtualPagePath, sum(visits) from landing_pages group by 1 order by 2 desc

Calculating occupany level between a date range

I'm having trouble trying to wrap my head around how to write this query to calculate the occupancy level of a hotel and then list the results by date. Consider the following type of data from a table called reservations:
Arrival Departure Guest Confirmation
08/01/2015 08/05/2015 John 13234
08/01/2015 08/03/2015 Bob 34244
08/02/2015 08/03/2015 Steve 32423
08/02/2015 08/02/2015 Mark 32411
08/02/2015 08/04/2014 Jenny 24422
Output Data would ideally look like:
Date Occupancy
08/01/2015 2
08/02/2015 4
08/03/2015 2
08/04/2015 1
08/02/2015 0
And the query should be able to utilize a date range as a variable. I'm having trouble getting the obviously hardest piece of how to both get the count per night and spitting it out by date.
You can generate a list of dates first. In Oracle you can do this by using connect by. This will make a recursive query. For instance, to get the next 30 days, you can select today and keep connecting until you've got the desired number of days. level indicates the level of recursion.
select trunc(sysdate) + level - 1 as THEDATE
from dual
connect by level <= 30;
On that list, you can query the number of reservations for each day in that period:
select THEDATE,
(select count(*)
from reservations r
where r.Arrival >= THEDATE and
r.Departure < THEDATE) as RESERVATIONCOUNT
from
( select trunc(sysdate) + level - 1 as THEDATE,
from dual
connect by level <= 30)
Instead of getting a fixed number of dates, you can also get another value there, for instance, to get at least 30 days in the future, but further if there are reservations for later..:
select THEDATE,
(select count(*)
from reservations r
where r.Arrival >= THEDATE and
r.Departure < THEDATE) as RESERVATIONCOUNT
from
( select trunc(sysdate) + level - 1 as THEDATE,
from dual
connect by
level <= greatest(30, (select trunc(max(DEPARTURE) - sysdate)
from reservations)))

Resources