BigQuery GA Exported with Duplicated Rows - google-analytics

We have been trying to explain why this happened in all of our datasets but so far we had no success.
We observed that starting on 18 April our ga_sessions dataset had for the most part duplicated entries (like 99% of rows). As an example, I tested this query:
SELECT
fullvisitorid fv,
visitid v,
ARRAY(
SELECT
AS STRUCT hits.*
FROM
UNNEST(hits) hits
ORDER BY
hits.hitnumber) h
FROM
`dafiti-analytics.40663402.ga_sessions*`
WHERE
1 = 1
AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
ORDER BY
fv,
v
LIMIT
100
And the result was:
We tried to investigate when this began to happen, so I ran this query:
SELECT
date,
f,
COUNT(f) freq from(
SELECT
date,
fullvisitorid fv,
visitid v,
COUNT(CONCAT(fullvisitorid, CAST(visitid AS string))) f
FROM
`dafiti-analytics.40663402.ga_sessions*`
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-04-01')
AND TIMESTAMP('2017-04-30')
GROUP BY
fv,
v,
date )
GROUP BY
f,
date
ORDER BY
date,
freq DESC
And we found that for 3 of our projects it started on day 18 April but in accounts related to LATAM data we started seeing duplicated rows just recently as well.
We also checked if in our GCP Console something was logged but couldn't find anything.
Is there some mistake we could have made that caused the duplication in the ga_sessions export? We checked our analytics tracking but it seems to be working just fine. Also there's no modification we did these days that explain it as well.
If you need more info please let me know.

Make sure to match only the intraday or non-intraday tables. For intraday:
`dafiti-analytics.40663402.ga_sessions_intraday*`
For non-intraday:
`dafiti-analytics.40663402.ga_sessions_2017*`
The important part is to include enough of the prefix to match the desired tables.

Related

COUNT(totals.visits) - is it an accurate measure of sessions?

I am trying to write a query in Google BQ where our GA data is exported. The query is below
SELECT visitStartTime,date,,hits.eCommerceAction.*,count(totals.visits)
FROM flatten([bigquery-xxxxxx:xxxxxxxx.ga_sessions_20180925],hits.eCommerceAction)
WHERE hits.eCommerceAction.action_type <> '0'
GROUP BY date,visitStartTime,hits.eCommerceAction.action_type,hits.eCommerceAction.option,hits.eCommerceAction.step
LIMIT 1000
The output from this looks something like this
date hits_type hits_step hits_option f0_
20180925 5 1 1 0
20180925 2 1 0 1
My question is that when there is an ecommerce hit being sent, how can the session count be 0? (f0 column). Since totals.visits can return 1 or NULL and since count only counts non NULL values, should I be counting any other field like visitID to avoid NULLs? All tutorials online are shown as using totals.visits so I am confused whether I am missing something here.
Thanks
If there is only non interaction hits in the session, totals.visits will be null. If you want to include both interaction and non interaction hits then it's correct to count unique visitId+fullVisitorId combinations.

Refresh Google Analytics Bigquery Export

Is it possible to refresh the Google Analytics BigQuery Export? I'm currently getting double the amount of bounces I should be on one day and I have no idea why (it's not double in GA)
Thanks
This might be happening because in your wildcard selection you ended up querying the "ga_sessions" and "intraday" tables at the same time.
Sometimes it happens that the ga_sessions table is created and the intraday is not deleted which results in your wildcard selecting both tables.
I usually add this condition to my WHERE clause in order to select only one of the tables, like so:
FROM `dataset_id.ga_sessions*`
WHERE
1 = 1
AND CASE WHEN (REGEXP_CONTAINS(_table_suffix, 'intraday') AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN "20170601" AND "20170602" ) THEN TRUE
WHEN (NOT REGEXP_CONTAINS(_table_suffix, 'intraday') AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN "20170525" AND "20170531") THEN TRUE END
If you want to select from previous "X" days until today, this might work (just replace X by how many days you want to go back in time, like 30 days for instance):
WHERE
1 = 1
AND CASE WHEN (REGEXP_CONTAINS(_table_suffix, 'intraday') AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 0 DAY))) THEN TRUE
WHEN (NOT REGEXP_CONTAINS(_table_suffix, 'intraday') AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL X DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY))) THEN TRUE END
Data that happened until yesterday I query for the "intraday" table. Other tables I scan through only consolidated ga_sessions.

Firebase exported to BigQuery: retention cohorts query

Firebase offer split testing functionality through Firebase remote configuration, but there are lack of ability to filter retention in cohorts sections with user properties (with any property in actual fact).
In quest of solution for this problem i'm looking for BigQuery, in reason of Firebase Analytics provide usable way to export data to this service.
But i stuck with many questions and google has no answer or example which may point me to the right direction.
General questions:
As first step i need to aggregate data which represent same data firebase cohorts do, so i can be sure my calculation is right:
Next step should be just apply constrains to the queries, so they match custom user properties.
Here what i get so far:
The main problem – big difference in users calculations. Sometimes it is about 100 users, but sometimes close to 1000.
This is approach i use:
# 1
# Count users with `user_dim.first_open_timestamp_micros`
# in specified period (w0 – week 1)
# this is the way firebase group users to cohorts
# (who started app on the same day or during the same week)
# https://support.google.com/firebase/answer/6317510
SELECT
COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
(
TABLE_DATE_RANGE
(
[admob-app-id-xx:xx_IOS.app_events_],
TIMESTAMP('2016-11-20'),
TIMESTAMP('2016-11-26')
)
)
WHERE
STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
BETWEEN '2016-11-20' AND '2016-11-26'
# 2
# For each next period count events with
# same first_open_timestamp
# Here is example for one of the weeks.
# week 0 is Nov20-Nov26, week 1 is Nov27-Dec03
SELECT
COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
(
TABLE_DATE_RANGE
(
[admob-app-id-xx:xx_IOS.app_events_],
TIMESTAMP('2016-11-27'),
TIMESTAMP('2016-12-03')
)
)
WHERE
STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
BETWEEN '2016-11-20' AND '2016-11-26'
# 3
# Now we have users for each week w1, w2, ... w5
# Calculate retention for each of them
# retention week 1 = w1 / w0 * 100 = 25.72181359
# rw2 = w2 / w1 * 100
# ...
# rw5 = w5 / w1 * 100
# 4
# Shift week 0 by one and repeat from step 1
BigQuery queries tips request
Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.
Here is BigQuery Export schema if needed
Side questions:
why all the user_dim.device_info.device_id and user_dim.device_info.resettable_device_idis null?
user_dim.app_info.app_id is missing from the doc (if firebase support teammate will be read this question)
how event_dim.timestamp_micros and event_dim.previous_timestamp_micros should be used, i can not get their purpose.
PS
It will be good someone from Firebase teammate answer this question. Five month ago there are was one mention about extending cohorts functionality with filtering or show bigqueries examples, but things are not moving. Firebase Analytics is way to go they said, Google Analytics is deprecated, they said.
Now i spend second day to lean bigquery and build my own solution over the existing analytics tools. I no, stack overflow is not the place for this comments, but guys are you thinking? Split testing may grammatically affect retention of my app. My app does not sold anything, funnels and events is not valuable metrics in many cases.
Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.
yes, generic bigquery will work fine
Below is not the most generic version, but can give you an idea
In this example I am using Stack Overflow Data available in Google BigQuery Public Datasets
First sub-select – activities – in most cases the only what you need to re-write to reflect specifics of your data.
What it does is:
a. Defines period you want to set for analysis.
In example below - it is a month - FORMAT_DATE('%Y-%m', ...
But you can use year, week, day or anything else – respectively
• By year - FORMAT_DATE('%Y', DATE(answers.creation_date)) AS period
• By week - FORMAT_DATE('%Y-%W', DATE(answers.creation_date)) AS period
• By day - FORMAT_DATE('%Y-%m-%d', DATE(answers.creation_date)) AS period
• …
b. Also it “filters” only the type of events/activity you need to analyse
for example, `WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' looks for answers for google-bigquery tagged question
The rest of sub-queries are more-less generic and mostly can be used as is
#standardSQL
WITH activities AS (
SELECT answers.owner_user_id AS id,
FORMAT_DATE('%Y-%m', DATE(answers.creation_date)) AS period
FROM `bigquery-public-data.stackoverflow.posts_answers` AS answers
JOIN `bigquery-public-data.stackoverflow.posts_questions` AS questions
ON questions.id = answers.parent_id
WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
GROUP BY id, period
), cohorts AS (
SELECT id, MIN(period) AS cohort FROM activities GROUP BY id
), periods AS (
SELECT period, ROW_NUMBER() OVER(ORDER BY period) AS num
FROM (SELECT DISTINCT cohort AS period FROM cohorts)
), cohorts_size AS (
SELECT cohort, periods.num AS num, COUNT(DISTINCT activities.id) AS ids
FROM cohorts JOIN activities ON activities.period = cohorts.cohort AND cohorts.id = activities.id
JOIN periods ON periods.period = cohorts.cohort
GROUP BY cohort, num
), retention AS (
SELECT cohort, activities.period AS period, periods.num AS num, COUNT(DISTINCT cohorts.id) AS ids
FROM periods JOIN activities ON activities.period = periods.period
JOIN cohorts ON cohorts.id = activities.id
GROUP BY cohort, period, num
)
SELECT
CONCAT(cohorts_size.cohort, ' - ', FORMAT("%'d", cohorts_size.ids), ' users') AS cohort,
retention.num - cohorts_size.num AS period_lag,
retention.period as period_label,
ROUND(retention.ids / cohorts_size.ids * 100, 2) AS retention , retention.ids AS rids
FROM retention
JOIN cohorts_size ON cohorts_size.cohort = retention.cohort
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2015-01-01'))
ORDER BY cohort, period_lag, period_label
You can visualize result of above query with the tool of your choice
Note: you can use either period_lag or period_label
See the difference of their use in below examples
with period_lag
with period_label

Big Query - Google Analytics - Time diff between first visit and purchase

Trying to get a list:
visitorid, time first visit, time of hit where transaction occurred.
What've I've written is only grabbing rows that have transaction revenue. I am also trying to convert visitStartTime which is a unix date, to a regular date via Date(visitStartTime) but that's failing in the group by because of the outputted date.
Any direction super helpful.
SELECT
fullvisitorID,
visitNumber,
visitStartTime,
hits.transaction.transactionRevenue
FROM
[75718103.ga_sessions_20150310],
[75718103.ga_sessions_20150309],
[75718103.ga_sessions_20150308],
[75718103.ga_sessions_20150307],
[75718103.ga_sessions_20150306],
[75718103.ga_sessions_20150305],
[75718103.ga_sessions_20150304],
[75718103.ga_sessions_20150303],
[75718103.ga_sessions_20150302],
WHERE totals.transactions >=1
GROUP BY
fullvisitorID, visitNumber, visitStartTime, hits.transaction.transactionRevenue;
visitStartTime is defined as POSIX time in Google Analytics schema, which means number of seconds since epoch. BigQuery TIMESTAMP is encoded as number of microseconds since epoch. Therefore, to get start time as TIMESTAMP, I used TIMESTAMP(INTEGERvisitStartTime*1000000)). hits.time contains number of milliseconds since first hit, therefore to get time of transactions, they needed to be multiplied by 1000 to get to microsecond granularity, hence TIMESTAMP(INTEGER(visitStartTime*1000000 + hits.time*1000)). Since hits is repeated RECORD, no GROUP BY is necessary, the data model already has all the hits grouped together.
Putting it all together:
SELECT
fullVisitorId,
timestamp(integer(visitStartTime*1000000)) as start_time,
timestamp(integer(visitStartTime*1000000 + hits.time*1000)) as transaction_time
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.transaction.transactionRevenue > 0
Mosha's solution is simple and elegant, but is too simple, actually it calculates the time between the first pageview and each transaction inside one visit, so it does not calculate the time between the first visit and the first transaction of one visitor. So if you calculate the average time using Mosha's query it will be 1.33 minute. But if you use the query I created it will be 9.91 minutes. My SQL skills are quite rusted, so it probably can be improved.
Masha's query (avg. time between the first pageview and each transaction inside one visit):
SELECT ROUND(AVG(MinutesToTransaction),2) AS avgMinutesToTransaction FROM (
SELECT
fullVisitorId,
timestamp(integer(visitStartTime*1000000)) as start_time,
timestamp(integer(visitStartTime*1000000 + hits.time*1000)) as transaction_time,
ROUND((TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000 + hits.time*1000))) - TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000)) )) / 60, 2) AS MinutesToTransaction
FROM
[google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.transaction.transactionRevenue > 0
)
My query (avg. time between the first visit and the first transaction of one visitor):
SELECT ROUND(AVG(MinutesToTransaction),2) AS avgMinutesToTransaction FROM (
SELECT firstInteraction.fullVisitorId,
MIN(firstInteraction.visitId) AS firstInteraction.visitId,
TIMESTAMP(INTEGER(MIN(firstInteraction.visitStartTime)*1000000)) AS timeFirstInteraction,
firstTransaction.visitId,
firstTransaction.timeFirstTransaction,
FIRST(BOOLEAN(firstInteraction.visitId = firstTransaction.visitId)) AS transactionInFirstVisit,
ROUND((TIMESTAMP_TO_SEC(firstTransaction.timeFirstTransaction) - TIMESTAMP_TO_SEC(TIMESTAMP(INTEGER(MIN(firstInteraction.visitStartTime)*1000000)))) / 60, 2) AS MinutesToTransaction
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] firstInteraction
INNER JOIN (
SELECT
fullVisitorId,
visitId,
TIMESTAMP(INTEGER(MIN(visitStartTime*1000000 + hits.time*1000))) AS timeFirstTransaction
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.type = "TRANSACTION"
GROUP BY 1, 2
) AS firstTransaction
ON firstInteraction.fullVisitorId = firstTransaction.fullVisitorId
GROUP BY 1, 4, 5
)
I left some extra fields so if you use it without the first SELECT you can see some interesting data.
Ps: Thanks Mosha for showing how to calculate the time.

Cognos: Count the number of occurences of a distinct id

I'm making a report in Cognos Report Studio and I'm having abit of trouble getting a count taht I need. What I need to do is count the number of IDs for a department. But I need to split the count between initiated and completed. If an ID occures more than once, it is to be counted as completed. The others, of course, will be initiated. So I'm trying to count the number of ID occurences for a distinct ID. Here is the query I've made in SQl Developer:
SELECT
COUNT((CASE WHEN COUNT(S.RFP_ID) > 8 THEN MAX(CT.GCT_STATUS_HISTORY_CLOSE_DT) END)) AS "Sales Admin Completed"
,COUNT((CASE WHEN COUNT(S.RFP_ID) = 8 THEN MIN(CT.GCT_STATUS_HISTORY_OPEN_DT) END)) as "Sales Admin Initiated"
FROM
ADM.B_RFP_WC_COVERAGE_DIM S
JOIN ADM.B_GROUP_CHANGE_REQUEST_DIM CR
ON S. RFP_ID = CR.GCR_RFP_ID
JOIN ADM.GROUP_CHANGE_TASK_FACT CT
ON CR.GROUP_CHANGE_REQUEST_KEY = CT.GROUP_CHANGE_REQUEST_KEY
JOIN ADM.B_DEPARTMENT_DIM D
ON D.DEPARTMENT_KEY = CT.DEPARTMENT_RESP_KEY
WHERE CR.GCR_CHANGE_TYPE_ID = '20'
AND S.RFP_LOB_IND = 'WC'
AND S.RFP_AUDIT_IND = 'N'
AND CR.GCR_RECEIVED_DT BETWEEN '01-JAN-13' AND '31-DEC-13'
AND D.DEPARTMENT_DESC = 'Sales'
AND CT.GCT_STATUS_IND = 'C'
GROUP BY S.RFP_ID ;
Now this works. But I'm not sure how to translate taht into Cognos. I tried doing a CASE taht looked liek this(this code is using basic names such as dept instead of D.DEPARTMENT_DESC):
CASE WHEN dept = 'Sales' AND count(ID for {DISTINCT ID}) > 1 THEN count(distinct ID)END)
I'm using count(distinct ID) instead of count(maximum(close_date)). But the results would be the same anyway. The "AND" is where I think its being lost. It obviously isn't the proper way to count occurences. But I'm hoping I'm close. Is there a way to do this with a CASE? Or at all?
--EDIT--
To make my question more clear, here is an example:
Say I have this data in my table
ID
---
1
2
3
4
2
5
5
6
2
My desired count output would be:
Initiated Completed
--------- ---------
4 2
This is because two of the distinct IDs (2 and 5) occure more than once. So they are counted as Completed. The ones that occure only once are counted as Initiated. I am able to do this in SQl Dev, but I can't figure out how to do this in Cognos Report Studio. I hope this helps to better explaine my issue.
Oh, I didn't quite got it originally, amending the answer.
But it's still easiest to do with 2 queries in Report Studio. Key moment is that you can use a query as a source for another query, guaranteeing proper group by's and calculations.
So if you have ID list in the table in Report Studio you create:
Query 1 with dataitems:
ID,
count(*) or count (1) as count_occurences
status (initiated or completed) with a formula: if (count_occurences > 1) then ('completed') else ('initiated').
After that you create a query 2 using query one as source with just 2 data items:
[Query1].[Status]
Count with formula: count([Query1].[ID])
That will give you the result you're after.
Here's a link to doco on how to nest queries:
http://pic.dhe.ibm.com/infocenter/cx/v10r1m0/topic/com.ibm.swg.ba.cognos.ug_cr_rptstd.10.1.0.doc/c_cr_rptstd_wrkdat_working_with_queries_rel.html?path=3_3_10_6#cr_rptstd_wrkdat_working_with_queries_rel

Resources