I'm trying to setup a rolling 7 day users & rolling 31 day users in BigQuery (w/ Firebase) using the following query. I want it where for each day it examines the previous 31 days as well as 7 days. I've been stuck and getting the message:
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
The query:
With events AS (
SELECT PARSE_DATE("%Y%m%d", event_date) as event_date, user_pseudo_id FROM `my_data_table.analytics_178206500.events_*`
Where _table_suffix NOT LIKE "i%" AND event_name = "user_engagement"
GROUP BY 1, 2
),
DAU AS (
SELECT event_date as date, COUNT(DISTINCT(user_pseudo_id)) AS dau
From events
GROUP BY 1
)
SELECT DAU.date, DAU.dau,
(
SELECT count(distinct(user_pseudo_id))
FROM events
WHERE events.event_date BETWEEN DATE_SUB(DAU.date, INTERVAL 29 DAY) and dau.date
) as mau,
(
SELECT count(distinct(user_pseudo_id))
FROM events
WHERE events.event_date BETWEEN DATE_SUB(DAU.date, INTERVAL 7 DAY) and dau.date
) as wau
FROM DAU
ORDER BY 1 DESC
I'm able to get the DAU part but the last 7 day users (WAU) & last 31 day users (MAU) aren't coming through. I have tried to CROSS JOIN DAU w/ events but I get the following results GraphResults
Any pointers would be greatly appreciated
Related
I have two time stamps #starttimestamp and #endtimestamp. How to calculate number of working hours between these two
Working hours is defined below:
Mon- Thursday (9:00-17:00)
Friday (9:00-13:00)
Have to work in impala
think i found a better solution.
we will create a series of numbers using a large table. You can get a time dimension type table too. Make it doenst get truncated. I am using a large table from my db.
Use this series to generate a date range between start and end date.
date_add (t.start_date,rs.uniqueid) -- create range of dates
join (select row_number() over ( order by mycol) as uniqueid -- create range of unique ids
from largetab) rs
where end_date >=date_add (t.start_date,rs.uniqueid)
Then we will calculate total hour difference between the timestamp using unix timestamp considering date and time.
unix_timestamp(endtimestamp - starttimestamp )
Exclude non working hours like 16hours on M-T, 20hours on F, 24hours on S-S.
case when dayofweek ( dday) in (1,7) then 24
when dayofweek ( dday) =5 then 20
else 16 end as non work hours
Here is complete SQL.
select
end_date, start_date,
diff_in_hr - sum(case when dayofweek ( dday) in (1,7) then 24
when dayofweek ( dday) =5 then 20
else 16 end ) total_workhrs
from (
select (unix_timestamp(end_date)- unix_timestamp(start_date))/3600 as diff_in_hr , end_date, start_date,date_add (t.start_date,rs.uniqueid) as dDay
from tdate t
join (select row_number() over ( order by mycol) as uniqueid from largetab) rs
where end_date >=date_add (t.start_date,rs.uniqueid)
)rs2
group by 1,2,diff_in_hr
I have the following data in BigQuery:
date fullVisitorId sessionId hitNumber type url eventCategory eventAction eventLabel
20210101 973454546035798949 973454546035798949162783837520210101 1 PAGE homepage.com Null Null Null
20210101 973454546035798949 973454546035798949162783837520210101 2 EVENT homepage.com/purchase View Book Harry_Potter
20210101 973454546035798949 973454546035798949162783837520210101 3 EVENT homepage.com/purchase Purchase Book Harry_Potter
...
I want to create a conversion funnel based on URLs and events, not necessarily sequential. For example, I want to calculate the number of distinct users (fullVisitorId) and the number of distinct sessions (sessionId) in which:
Users visited the homepage (homepage.com).
Then the event with Category View, Action Book and Label Harry_Potter was triggered,
Then the event with Category Purchase, Action Book and Label Harry_Potter was triggered.
Again the hits are not necessarily sequential, which means that the hit numbers could be 1, 4, and 8, respectively, for these 3 steps. Also, the real number of desired steps is more than 10.
Ideally, the final results should look like this:
Type Date Step 1 Step 2 Step 3 Step 4
Users 01/01/2021 120 110 90 ...
Users 02/01/2021 130 80 70 ...
Sessions 01/01/2021 200 120 100 ...
Sessions 02/01/2021 220 80 70 ...
where Step 1, Step 2, and Step 3 represent the number of users and sessions in which the particular step was done.
Any ideas? Thanks!
You will have to do something like this, below SQL code. For every condition u can have a CTE and then join.
WITH STEP1 AS
(
SELECT fullVisitorId, Date, SUM(hitNumber) AS STEP1
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE STARTS_WITH(url, "homepage.com") AND fullVisitorId IS NOT NULL
GROUP BY fullVisitorId, Date
),
STEP2 AS
(
SELECT fullVisitorId, Date, SUM(hitNumber) AS STEP2
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE eventCategory = 'View' AND eventAction = 'Book' AND eventLabel = 'Harry_Potter' AND fullVisitorId IS NOT NULL
GROUP BY fullVisitorId, Date
),
STEP3 AS
(
SELECT fullVisitorId, Date, SUM(hitNumber) AS STEP3
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE eventCategory = 'Purchase' AND eventAction = 'Book' AND eventLabel = 'Harry_Potter' AND fullVisitorId IS NOT NULL
GROUP BY fullVisitorId, Date
),
JOINED_DATA AS
(
SELECT 'Users' AS Type,
coalesce(SUB_QUERY.Date,STEP3.Date),
STEP1.STEP1,
STEP2.STEP2,
STEP3.STEP3
FROM STEP3 FULL OUTER JOIN
(
SELECT coalesce(STEP1.fullVisitorId,STEP2.fullVisitorId) AS fullVisitorId,
coalesce(STEP1.Date,STEP2.Date) AS Date
FROM STEP1 FULL OUTER JOIN STEP2
ON STEP1.DATE = STEP2.DATE AND STEP1.fullVisitorId = STEP2.fullVisitorId
) AS SUB_QUERY
ON STEP3.fullVisitorId = SUB_QUERY.fullVisitorId AND STEP3.Date = SUB_QUERY.Date
)
SELECT * FROM JOINED_DATA
We've used the following query to inspect visitNumber over time and found that for a particular fullVisitorId they can have more than one 'first' visit.
select
count(distinct fullVisitorId) as users,
newVisits
From(
select fullVisitorId, visitNumber, count(distinct visitId) as newVisits
from table_date_range([91311726.ga_sessions_], timestamp('20151101'), timestamp('20161124') )
where visitNumber = 1
group by fullVisitorId, visitNumber )
group by newVisits;
Result:
| users | newVisits |
|-----------|------------|
| 18 | 3 |
| 26041561 | 1 |
| 237792 | 2 |
My understanding is that for Universal Analytics the visitNumber is a counter on the Google Analytics backend that iterates for each new session per fullVisitorId, so how is it possible to have more than one session with vistNumber = 1?
There are 2 main causes for this.
Visits spanning multiple day boundaries. Say a visit starts at 20151101 #11:45pm and lasts until 20151102 # 1:00am This can create 2 different sessions but the visitNumber won't be incremented.
If a user last session was over 183 days ago it will be considered a new user and it's visitNumber will reset to 1. The reason is because Analytics has to do a lookback to see when was the last session to increase the visitNumber count, but the maximum lookback is 183 days. So maybe a user visited on 20151101 and then only came back on 20160701, this would cause both visits to have a visitNumber=1
I've got Google Analytics running on a website and am now trying to determine the conversion rate in certain time intervals. I therefore have a table that contains
interval_id
i.interval_start_time_utc
i.interval_stop_time_utc
Sadly, the following BigQuery query that would assign each order to an interval will not work:
SELECT
totals.transactions,
totals.visits,
i.interval_id
FROM [123456.ga_sessions_20160609]
INNER JOIN intervals i ON i.interval_start_time_utc < visitStartTime AND visitStartTime < i.interval_end_time_utc
This throws the error
ON clause must be AND of = comparisons of one field name from each table [...]
so I gather that BigQuery simply doesn't do range joins. Is there another way to do this short of doing a full join and then paring down? Are there entirely different, better approaches for this sort of thing?
BigQuery Standard SQL doesn't have this limitation - see Enabling Standard SQL
If yo want to make with BigQuery Legacy SQL - try something like below
SELECT
totals.transactions,
totals.visits,
i.interval_id
FROM [123456.ga_sessions_20160609]
CROSS JOIN intervals i
WHERE i.interval_start_time_utc < visitStartTime
AND visitStartTime < i.interval_end_time_utc
For the sake of presenting idea - let’s simplify example
And let’s remember - we do want to make it with BigQuery Legacy SQL - not with Standard SqL where it is trivial!
Challenge
Assume we have visits table:
SELECT visit_time FROM
(SELECT 2 AS visit_time),
(SELECT 12 AS visit_time),
(SELECT 22 AS visit_time),
(SELECT 32 AS visit_time)
and intervals table:
SELECT before, after, event FROM
(SELECT 1 AS before, 5 AS after, 3 AS event),
(SELECT 6 AS before, 10 AS after, 8 AS event),
(SELECT 21 AS before, 25 AS after, 23 AS event),
(SELECT 33 AS before, 37 AS after, 35 AS event)
We want to extract all visits which are within event’s before and after values
This can be simply done with use of CROSS JOIN like below:
SELECT
visit_time, event, before, after
FROM (
SELECT visit_time FROM
(SELECT 2 AS visit_time),
(SELECT 12 AS visit_time),
(SELECT 22 AS visit_time),
(SELECT 32 AS visit_time),
) AS visits
CROSS JOIN (
SELECT before, after, event FROM
(SELECT 1 AS before, 5 AS after, 3 AS event),
(SELECT 6 AS before, 10 AS after, 8 AS event),
(SELECT 21 AS before, 25 AS after, 23 AS event),
(SELECT 33 AS before, 37 AS after, 35 AS event)
) AS intervals
WHERE visit_time BETWEEN before AND after
With result as:
visit_time event before after
2 3 1 5
22 23 21 25
Potential Issue
When both tables are big enough – this cross join becomes quite expensive!
Hint
It happened that (from user’s comments) - The intervals are always the x units to the left and right of event.
Solution
Below is proposed solution/option that uses hint/fact and makes use of JOIN instead of CROSS JOIN between two big tables
The key here is to generate (on fly) new table that will hold all possible interval’s values based on event and x
SELECT event, event + delta AS point
FROM (
SELECT event FROM
(SELECT 1 AS before, 5 AS after, 3 AS event),
(SELECT 6 AS before, 10 AS after, 8 AS event),
(SELECT 21 AS before, 25 AS after, 23 AS event),
(SELECT 33 AS before, 37 AS after, 35 AS event)
) AS events
CROSS JOIN (
SELECT pos - 1 - 2 AS delta FROM (
SELECT ROW_NUMBER() OVER() AS pos, * FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + 2 * 2, '.'),'') AS h FROM (SELECT NULL)),h
)))
) AS deltas
In above code x = 2 – but you can change it in two places, for example if x = 5 you should have
SELECT pos - 1 - 5 AS delta FROM (
SELECT ROW_NUMBER() OVER() AS pos, * FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + 2 * 5, '.'),'') AS h FROM (SELECT NULL)),h
)))
CROSS JOIN in above code is inexpensive because deltas table is quite small
So, finally now, you can have your result with below:
SELECT
visit_time, event
FROM (
SELECT visit_time FROM
(SELECT 2 AS visit_time),
(SELECT 12 AS visit_time),
(SELECT 22 AS visit_time),
(SELECT 32 AS visit_time),
) AS visits
JOIN (
SELECT event, event + delta AS point
FROM (
SELECT event FROM
(SELECT 1 AS before, 5 AS after, 3 AS event),
(SELECT 6 AS before, 10 AS after, 8 AS event),
(SELECT 21 AS before, 25 AS after, 23 AS event),
(SELECT 33 AS before, 37 AS after, 35 AS event)
) AS events
CROSS JOIN (
SELECT pos - 1 - 2 AS delta FROM (
SELECT ROW_NUMBER() OVER() AS pos, * FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + 2 * 2, '.'),'') AS h FROM (SELECT NULL)),h
)))
) AS deltas
) AS points
ON points.point = visits.visit_time
With expected result
visit_time event
2 3
22 23
I think above approach can work for you – but you sure need to adopt it to your particular case
I think this can be done relatively easy if you will round all your involved times up to respective minutes
Hope this will help
Share result with us if you will get this work :o)
I'm having trouble trying to wrap my head around how to write this query to calculate the occupancy level of a hotel and then list the results by date. Consider the following type of data from a table called reservations:
Arrival Departure Guest Confirmation
08/01/2015 08/05/2015 John 13234
08/01/2015 08/03/2015 Bob 34244
08/02/2015 08/03/2015 Steve 32423
08/02/2015 08/02/2015 Mark 32411
08/02/2015 08/04/2014 Jenny 24422
Output Data would ideally look like:
Date Occupancy
08/01/2015 2
08/02/2015 4
08/03/2015 2
08/04/2015 1
08/02/2015 0
And the query should be able to utilize a date range as a variable. I'm having trouble getting the obviously hardest piece of how to both get the count per night and spitting it out by date.
You can generate a list of dates first. In Oracle you can do this by using connect by. This will make a recursive query. For instance, to get the next 30 days, you can select today and keep connecting until you've got the desired number of days. level indicates the level of recursion.
select trunc(sysdate) + level - 1 as THEDATE
from dual
connect by level <= 30;
On that list, you can query the number of reservations for each day in that period:
select THEDATE,
(select count(*)
from reservations r
where r.Arrival >= THEDATE and
r.Departure < THEDATE) as RESERVATIONCOUNT
from
( select trunc(sysdate) + level - 1 as THEDATE,
from dual
connect by level <= 30)
Instead of getting a fixed number of dates, you can also get another value there, for instance, to get at least 30 days in the future, but further if there are reservations for later..:
select THEDATE,
(select count(*)
from reservations r
where r.Arrival >= THEDATE and
r.Departure < THEDATE) as RESERVATIONCOUNT
from
( select trunc(sysdate) + level - 1 as THEDATE,
from dual
connect by
level <= greatest(30, (select trunc(max(DEPARTURE) - sysdate)
from reservations)))