Trying to get a list:
visitorid, time first visit, time of hit where transaction occurred.
What've I've written is only grabbing rows that have transaction revenue. I am also trying to convert visitStartTime which is a unix date, to a regular date via Date(visitStartTime) but that's failing in the group by because of the outputted date.
Any direction super helpful.
SELECT
fullvisitorID,
visitNumber,
visitStartTime,
hits.transaction.transactionRevenue
FROM
[75718103.ga_sessions_20150310],
[75718103.ga_sessions_20150309],
[75718103.ga_sessions_20150308],
[75718103.ga_sessions_20150307],
[75718103.ga_sessions_20150306],
[75718103.ga_sessions_20150305],
[75718103.ga_sessions_20150304],
[75718103.ga_sessions_20150303],
[75718103.ga_sessions_20150302],
WHERE totals.transactions >=1
GROUP BY
fullvisitorID, visitNumber, visitStartTime, hits.transaction.transactionRevenue;
visitStartTime is defined as POSIX time in Google Analytics schema, which means number of seconds since epoch. BigQuery TIMESTAMP is encoded as number of microseconds since epoch. Therefore, to get start time as TIMESTAMP, I used TIMESTAMP(INTEGERvisitStartTime*1000000)). hits.time contains number of milliseconds since first hit, therefore to get time of transactions, they needed to be multiplied by 1000 to get to microsecond granularity, hence TIMESTAMP(INTEGER(visitStartTime*1000000 + hits.time*1000)). Since hits is repeated RECORD, no GROUP BY is necessary, the data model already has all the hits grouped together.
Putting it all together:
SELECT
fullVisitorId,
timestamp(integer(visitStartTime*1000000)) as start_time,
timestamp(integer(visitStartTime*1000000 + hits.time*1000)) as transaction_time
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.transaction.transactionRevenue > 0
Mosha's solution is simple and elegant, but is too simple, actually it calculates the time between the first pageview and each transaction inside one visit, so it does not calculate the time between the first visit and the first transaction of one visitor. So if you calculate the average time using Mosha's query it will be 1.33 minute. But if you use the query I created it will be 9.91 minutes. My SQL skills are quite rusted, so it probably can be improved.
Masha's query (avg. time between the first pageview and each transaction inside one visit):
SELECT ROUND(AVG(MinutesToTransaction),2) AS avgMinutesToTransaction FROM (
SELECT
fullVisitorId,
timestamp(integer(visitStartTime*1000000)) as start_time,
timestamp(integer(visitStartTime*1000000 + hits.time*1000)) as transaction_time,
ROUND((TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000 + hits.time*1000))) - TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000)) )) / 60, 2) AS MinutesToTransaction
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.transaction.transactionRevenue > 0
)
My query (avg. time between the first visit and the first transaction of one visitor):
SELECT ROUND(AVG(MinutesToTransaction),2) AS avgMinutesToTransaction FROM (
SELECT firstInteraction.fullVisitorId,
MIN(firstInteraction.visitId) AS firstInteraction.visitId,
TIMESTAMP(INTEGER(MIN(firstInteraction.visitStartTime)*1000000)) AS timeFirstInteraction,
firstTransaction.visitId,
firstTransaction.timeFirstTransaction,
FIRST(BOOLEAN(firstInteraction.visitId = firstTransaction.visitId)) AS transactionInFirstVisit,
ROUND((TIMESTAMP_TO_SEC(firstTransaction.timeFirstTransaction) - TIMESTAMP_TO_SEC(TIMESTAMP(INTEGER(MIN(firstInteraction.visitStartTime)*1000000)))) / 60, 2) AS MinutesToTransaction
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] firstInteraction
INNER JOIN (
SELECT
fullVisitorId,
visitId,
TIMESTAMP(INTEGER(MIN(visitStartTime*1000000 + hits.time*1000))) AS timeFirstTransaction
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.type = "TRANSACTION"
GROUP BY 1, 2
) AS firstTransaction
ON firstInteraction.fullVisitorId = firstTransaction.fullVisitorId
GROUP BY 1, 4, 5
)
I left some extra fields so if you use it without the first SELECT you can see some interesting data.
Ps: Thanks Mosha for showing how to calculate the time.
Related
I have a SQLite table representing a chatlog. The two important columns for this question are 'content' and 'timestamp'.
I need to group the messages in the chatlog by conversations. Each message is only an individual line, so a conversation can be selected as each message joined by a new line using group_concat
group_concat(content, CHAR(10)
I want to identify a conversation by any messages which are within a length of time (such as 15 minutes) from each other. A conversation can be any length (including just an individual message, if there are no other messages within 15 minutes of it).
Knowing this, I can identify whether a message is the start or part of a conversation as
WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
But this is as far as I've gotten. I can make a column 'is_new_convo' using
WITH ordered_messages AS (
SELECT content, timestamp
FROM messages
ORDER BY timestamp
), conversations_identified AS (
SELECT *,
CASE
WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
THEN 0
ELSE 1
END AS is_new_convo
FROM ordered_messages
) SELECT * FROM conversations_identified
How can I then form a group of messages from where is_new_convo = 1 to the last subsequent is_new_convo = 0?
Here is some sample data and the expected result.
If you take the sum of the is_new_convo column from the start to a certain row, you get the number of times a new conversation has been formed, resulting in an ID that is unique for all messages in a conversation (since is_new_convo is 0 for messages continuing a conversation, they result in the same conversation ID). Using this, we can find the conversation ID for all messages, then group them together for group_concat. This doesn't require referencing the original table multiple times, so the 'WITH' clauses aren't needed.
SELECT group_concat(content, CHAR(10)) as conversation
FROM (
SELECT content, timestamp,
SUM(is_new_convo) OVER (ORDER BY timestamp) as conversation_id
FROM (
SELECT content, timestamp,
CASE
WHEN timestamp - LAG(timestamp, 1, timestamp) OVER (ORDER BY timestamp) < 900
THEN 0
ELSE 1
END AS is_new_convo
FROM messages
)
) GROUP BY conversation_id
I am stuck with a simple problem related to finding out which queries took more than usual to complete. My script is the following:
locking row for access
SELECT
username,
CollectTimeStamp,
((firstresptime - starttime ) HOUR TO second ) AS ElapsedTime,
((firstresptime - firststeptime ) HOUR TO second ) AS ExecutionTime,
CAST(((firstresptime - firststeptime) SECOND) AS INTEGER) AS ExecutionTimeInt,
(ElapsedTime - ExecutionTime) AS Delay,
-- other kpis here
FROM dbql_data.dbql_all
where username ='MyUser';
and dateofday> '2017-07-01'
and ExecutionTimeInt > 5
However, I get records having ExecutionTimeInt less than 5.
Question: how can I get the records having a timeinterval greater than a certain value?
Extra info:
select * from dbc.dbcinfo; returns
InfoKey InfoData
1 VERSION 15.10.04.10
2 RELEASE 15.10.04.02
3 LANGUAGE SUPPORT MODE Standard
The ExecutionTimeInt calculation is likely to fail with an Interval overflow as it's limited to 9999 seconds.
ElapsedTime is an Interval, the correct way to compare is:
WHERE ElapsedTime > interval '5' second
or
WHERE ElapsedTime > interval '1' minute
We have been trying to explain why this happened in all of our datasets but so far we had no success.
We observed that starting on 18 April our ga_sessions dataset had for the most part duplicated entries (like 99% of rows). As an example, I tested this query:
SELECT
fullvisitorid fv,
visitid v,
ARRAY(
SELECT
AS STRUCT hits.*
FROM
UNNEST(hits) hits
ORDER BY
hits.hitnumber) h
FROM
`dafiti-analytics.40663402.ga_sessions*`
WHERE
1 = 1
AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
ORDER BY
fv,
v
LIMIT
100
And the result was:
We tried to investigate when this began to happen, so I ran this query:
SELECT
date,
f,
COUNT(f) freq from(
SELECT
date,
fullvisitorid fv,
visitid v,
COUNT(CONCAT(fullvisitorid, CAST(visitid AS string))) f
FROM
`dafiti-analytics.40663402.ga_sessions*`
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-04-01')
AND TIMESTAMP('2017-04-30')
GROUP BY
fv,
v,
date )
GROUP BY
f,
date
ORDER BY
date,
freq DESC
And we found that for 3 of our projects it started on day 18 April but in accounts related to LATAM data we started seeing duplicated rows just recently as well.
We also checked if in our GCP Console something was logged but couldn't find anything.
Is there some mistake we could have made that caused the duplication in the ga_sessions export? We checked our analytics tracking but it seems to be working just fine. Also there's no modification we did these days that explain it as well.
If you need more info please let me know.
Make sure to match only the intraday or non-intraday tables. For intraday:
`dafiti-analytics.40663402.ga_sessions_intraday*`
For non-intraday:
`dafiti-analytics.40663402.ga_sessions_2017*`
The important part is to include enough of the prefix to match the desired tables.
I have a 100GB table which I want to process in R. When I export it to csvs I get 500 csv files - when I read them in r into data tables and bind them - I get a huge data table which cann't be saved/loaded (even when I increase the memory of the virtual instance that the R is installed on). I wanted to try a different attitude - split the original table, export to R, then process each table seperately. The problem is that I din't want the split to "break" in the middle of some grouping. For example - my key variable is "visit", and each visit may have several rows. I don't want that there will be a visit which is broken into different sub-tables (beacuse all my processing in R is done using visit as the grouping variable of data table). what is the best way to do it? I tried to order the visit ids by time, to export only their names to a spearate csv etc. - all the order by trials are ended with an error (not enough resources). The table currently contains more than 100M rows, with 64 variables.
I wanted to try a different attitude - split the original table …
The problem is that I din't want the split to "break" in the middle of some grouping.
Below is how to identify batches such that rows for same visitid will be in the same batch
For each batch max and min visitid are identified so that you can then use them to extract only rows for those visitids between min and max values thus controlling size of your to be processed data
1 – Batching by number of rows
Replace 1000000 below with whatever you want batch size to be in terms of number of rows
#legacySQL
SELECT
batch,
SUM(size) AS size,
COUNT(visitId) AS visitids_count,
MIN(visitId) AS visitId_min,
MAX(visitId) AS visitId_max
FROM (
SELECT
visitId,
size,
INTEGER(CEIL(total/1000000)) AS batch
FROM (
SELECT
visitId,
size,
SUM(size) OVER(ORDER BY visitId ) AS total
FROM (
SELECT visitId, COUNT(1) AS size
FROM [yourproject:yourdataset.yourtable]
GROUP BY visitId
)
)
)
GROUP BY batch
2 – Batching by bytes size of batch
Replace 1000000000 below with whatever you want batch size to be in terms of bytes
And replace 123 below with eastimated average size of one row in bytes
#legacySQL
SELECT
batch,
SUM(size) AS size,
COUNT(visitId) AS visitids_count,
MIN(visitId) AS visitId_min,
MAX(visitId) AS visitId_max
FROM (
SELECT
visitId,
size,
INTEGER(CEIL(total/1000000000)) AS batch
FROM (
SELECT
visitId,
size,
SUM(size) OVER(ORDER BY visitId ) AS total
FROM (
SELECT visitId, SUM(123) AS size
FROM [yourproject:yourdataset.yourtable]
GROUP BY visitId
)
)
)
GROUP BY batch
Above helps you to be prepared for proper splitting your original table using batches min and max values
Hope this help you to proceed further
Note: above assumes normal distribution of rows for visitid and relatively big number of rows in table (like in your example), so batches will be reasonably evenly sized
Note 2: I realized I wrote it quickly in Legacy SQL , so below is version in Standard SQL in case if you want to migrate or already using it
#standardSQL
SELECT
batch,
SUM(size) AS size,
COUNT(visitId) AS visitids_count,
MIN(visitId) AS visitId_min,
MAX(visitId) AS visitId_max
FROM (
SELECT
visitId,
size,
CAST(CEIL(total/1000000) as INT64) AS batch
FROM (
SELECT
visitId,
size,
SUM(size) OVER(ORDER BY visitId ) AS total
FROM (
SELECT visitId, COUNT(1) AS size
FROM `yourproject.yourdataset.yourtable`
GROUP BY visitId
)
)
)
GROUP BY batch
Firebase offer split testing functionality through Firebase remote configuration, but there are lack of ability to filter retention in cohorts sections with user properties (with any property in actual fact).
In quest of solution for this problem i'm looking for BigQuery, in reason of Firebase Analytics provide usable way to export data to this service.
But i stuck with many questions and google has no answer or example which may point me to the right direction.
General questions:
As first step i need to aggregate data which represent same data firebase cohorts do, so i can be sure my calculation is right:
Next step should be just apply constrains to the queries, so they match custom user properties.
Here what i get so far:
The main problem – big difference in users calculations. Sometimes it is about 100 users, but sometimes close to 1000.
This is approach i use:
# 1
# Count users with `user_dim.first_open_timestamp_micros`
# in specified period (w0 – week 1)
# this is the way firebase group users to cohorts
# (who started app on the same day or during the same week)
# https://support.google.com/firebase/answer/6317510
SELECT
COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
(
TABLE_DATE_RANGE
(
[admob-app-id-xx:xx_IOS.app_events_],
TIMESTAMP('2016-11-20'),
TIMESTAMP('2016-11-26')
)
)
WHERE
STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
BETWEEN '2016-11-20' AND '2016-11-26'
# 2
# For each next period count events with
# same first_open_timestamp
# Here is example for one of the weeks.
# week 0 is Nov20-Nov26, week 1 is Nov27-Dec03
SELECT
COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
(
TABLE_DATE_RANGE
(
[admob-app-id-xx:xx_IOS.app_events_],
TIMESTAMP('2016-11-27'),
TIMESTAMP('2016-12-03')
)
)
WHERE
STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
BETWEEN '2016-11-20' AND '2016-11-26'
# 3
# Now we have users for each week w1, w2, ... w5
# Calculate retention for each of them
# retention week 1 = w1 / w0 * 100 = 25.72181359
# rw2 = w2 / w1 * 100
# ...
# rw5 = w5 / w1 * 100
# 4
# Shift week 0 by one and repeat from step 1
BigQuery queries tips request
Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.
Here is BigQuery Export schema if needed
Side questions:
why all the user_dim.device_info.device_id and user_dim.device_info.resettable_device_idis null?
user_dim.app_info.app_id is missing from the doc (if firebase support teammate will be read this question)
how event_dim.timestamp_micros and event_dim.previous_timestamp_micros should be used, i can not get their purpose.
PS
It will be good someone from Firebase teammate answer this question. Five month ago there are was one mention about extending cohorts functionality with filtering or show bigqueries examples, but things are not moving. Firebase Analytics is way to go they said, Google Analytics is deprecated, they said.
Now i spend second day to lean bigquery and build my own solution over the existing analytics tools. I no, stack overflow is not the place for this comments, but guys are you thinking? Split testing may grammatically affect retention of my app. My app does not sold anything, funnels and events is not valuable metrics in many cases.
Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.
yes, generic bigquery will work fine
Below is not the most generic version, but can give you an idea
In this example I am using Stack Overflow Data available in Google BigQuery Public Datasets
First sub-select – activities – in most cases the only what you need to re-write to reflect specifics of your data.
What it does is:
a. Defines period you want to set for analysis.
In example below - it is a month - FORMAT_DATE('%Y-%m', ...
But you can use year, week, day or anything else – respectively
• By year - FORMAT_DATE('%Y', DATE(answers.creation_date)) AS period
• By week - FORMAT_DATE('%Y-%W', DATE(answers.creation_date)) AS period
• By day - FORMAT_DATE('%Y-%m-%d', DATE(answers.creation_date)) AS period
• …
b. Also it “filters” only the type of events/activity you need to analyse
for example, `WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' looks for answers for google-bigquery tagged question
The rest of sub-queries are more-less generic and mostly can be used as is
#standardSQL
WITH activities AS (
SELECT answers.owner_user_id AS id,
FORMAT_DATE('%Y-%m', DATE(answers.creation_date)) AS period
FROM `bigquery-public-data.stackoverflow.posts_answers` AS answers
JOIN `bigquery-public-data.stackoverflow.posts_questions` AS questions
ON questions.id = answers.parent_id
WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
GROUP BY id, period
), cohorts AS (
SELECT id, MIN(period) AS cohort FROM activities GROUP BY id
), periods AS (
SELECT period, ROW_NUMBER() OVER(ORDER BY period) AS num
FROM (SELECT DISTINCT cohort AS period FROM cohorts)
), cohorts_size AS (
SELECT cohort, periods.num AS num, COUNT(DISTINCT activities.id) AS ids
FROM cohorts JOIN activities ON activities.period = cohorts.cohort AND cohorts.id = activities.id
JOIN periods ON periods.period = cohorts.cohort
GROUP BY cohort, num
), retention AS (
SELECT cohort, activities.period AS period, periods.num AS num, COUNT(DISTINCT cohorts.id) AS ids
FROM periods JOIN activities ON activities.period = periods.period
JOIN cohorts ON cohorts.id = activities.id
GROUP BY cohort, period, num
)
SELECT
CONCAT(cohorts_size.cohort, ' - ', FORMAT("%'d", cohorts_size.ids), ' users') AS cohort,
retention.num - cohorts_size.num AS period_lag,
retention.period as period_label,
ROUND(retention.ids / cohorts_size.ids * 100, 2) AS retention , retention.ids AS rids
FROM retention
JOIN cohorts_size ON cohorts_size.cohort = retention.cohort
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2015-01-01'))
ORDER BY cohort, period_lag, period_label
You can visualize result of above query with the tool of your choice
Note: you can use either period_lag or period_label
See the difference of their use in below examples
with period_lag
with period_label