Optimize the performance of product scoped query - google-analytics

I would like to build the following table every day, to store some aggregate data on page performance of a website. However, each days worth of data is over 15 million rows.
What steps can I take to improve performance? I am intending to save them as sharded tables, but I would like to improve further, could I nest the data within each table to improve performance further? What would be the best way to do this?
SELECT
device.devicecategory AS device,
hits_product.productListName AS list_name,
UPPER(hits_product.productSKU) AS SKU,
AVG(hits_product.productListPosition) AS avg_plp_position
FROM `mindful-agency-136314.43786551.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(hits.product) AS hits_product
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND hits_product.productListName != "(not set)"
GROUP BY
device,
list_name,
SKU

Since you're using productSku and productListName as dimensions/groups there is no way around cross joining with product array.
You're also cross joining with product which can be dangerous because sometimes this array is missing and you destroy the whole row - typically you'd use a left join. But in this case, it's fine because you're only interested in product fields.
You should, however, be clear about whether you want to see list clicks or list impressions using hits.product.isImpression and hits.product.isClick. Atm I don't see a distinction there. Maybe filter for WHERE hits_product.isImpression in case of list views?
Instead of shards you might want to consider adding a date field and PARTITION BY date as well as CLUSTER BY list_name. See INSERT Statement for updating
and DDL Syntax to start the table. This is more performant than shards when it comes to querying the table later.
Starting the table could look something like this:
CREATE TABLE `x.y.z`
PARTITION BY date
CLUSTER BY list_name
AS (
SELECT
PARSE_DATE('%Y%m%d',date) AS date,
device.devicecategory AS device,
hits_product.productListName AS list_name,
UPPER(hits_product.productSKU) AS SKU,
AVG(IF(hits_product.isClick, hits_product.productListPosition, NULL)) AS avg_plp_click_position,
AVG(IF(hits_product.isImpression, hits_product.productListPosition, NULL)) AS avg_plp_view_position
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(hits.product) AS hits_product
WHERE
parse_date('%y%m%d', _table_suffix)
between
DATE_sub(current_date(), interval 1 day)
and DATE_sub(current_date(), interval 1 day)
AND hits_product.productListName != "(not set)"
GROUP BY
date,
device,
list_name,
SKU
)
Inserting new records is quite similar, you just need to mention the fields upfront as described in the documentation.

Related

Daily schedule in BigQuery using data from Firebase analytics

So I have created a daily schedule in BigQuery using "Append to table" preference, so every day it adds yesterday's data to my specified table. I have scheduled to run this query every day at 9AM, but the issue is that sometimes Firebase creates previous day data table in BigQuery later then 9AM.
The example of daily scheduled SELECT I would be using is:
SELECT * FROM `analytics.events_*` WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
What would be the best practice to schedule a daily update for the previous day in BigQuery from Firebase, so there are no times where I am missing days?
Bigquery Schedules are set to run at fixed times. If your incoming data is varying in delivery time then BigQuery Schedules are not what you're looking for.
But if you insist in using BigQuery Schedules, you could just relax the WHERE condition and catch "missing" days the next time the schedule runs. Then you flipped your problem and instead need to handle the case of not appending already appended rows (also increasing query cost):
SELECT *
FROM `analytics.events_*`
LEFT JOIN [target dataset].[target table] AS T
USING (event_name, event_timestamp, user_pseudo_id)
WHERE T.event_name IS NULL
AND T.event_timestamp IS NULL
AND T.user_pseudo_id IS NULL
AND _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY))
Or you could alternatively modify the query into an INSERT statement where you insert records and handle duplications similarly:
INSERT `[target dataset].[target table]`
SELECT *
FROM `analytics.events_*`
LEFT JOIN `[target dataset].[target table]` AS T
USING (event_name, event_timestamp, user_pseudo_id)
WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY))
AND T.event_name IS NULL
AND T.event_timestamp IS NULL
AND T.user_pseudo_id IS NULL
Then you wouldn't need to configure a destination table for the schedule.
Futhermore, if your target table is timestamp partitioned, you can reduce amount of data scanned by limiting the range in which you scan in the target table by adding an additional WHERE condition that strictly limits to a single date instead of the entire table:
...
AND DATE(T.event_timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 2)
...

BigQuery -firebase export working different when using wildcard character and _TABLE_SUFFIX compared to without using it

My Requirement:
To append unnested data in a separate table and use it for visualization and analytics
Implementing it :
As I am not sure at what time exactly events_intraday_YYYYMMDD syncs into events_YYYYMMDD for reference check here
0- Created an events_normalized table once at the start by using (It is done once not daily)
create analytics_data_export.events_normalized AS
SELECT .....
FROM
`analytics_xxxxxx.events_*
to collect all the data from events_YYYYMMDD
1- Creating/Replacing a daily temp table with
create or replace table analytics_data_export.daily_data_temp AS
SELECT...
_TABLE_SUFFIX BETWEEN
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 4 DAY)) AND
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
as I have seen multiple days data syncing together so to be on the safe side I am using 1-4 days data
2- Deleting the inner join of both the tables(daily_data_temp,events_normalized) from events_normalized to remove any duplicates it might have like let's say events_normalized has data till 18th but daily_data_temp has data from 16-19th so all the rows till 18th from events_normalized will be removed
4- Reinserting daily_data_temp in the events_normalized
Questions:
1- Is there any optimized way of implementing the requirements
2- In the 0th step while creating events_normalized table if I use :
WHERE
_TABLE_SUFFIX <=
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 0 DAY))
I get different results as compared to when I am using
create analytics_data_export.events_normalized AS
SELECT .....
FROM
`analytics_xxxxxx.events_*
The difference is the latter one has the current date data as well wherein events_YYYYMMDD I can only see data of yesterday. I don't understand this behavior
Like if the current day is 20th July in events_YYYYMMDD I can see only till events_20200719
To optimize you can follow below steps:
Create hash out of event_time_stamp and other unique fields, use this to filter the data
Instead of deleting duplicate rows from the larger initial table delete them from small temp table and then insert the table.
its because the filter analytics_xxxxxx.events_* will match both per day events table and intraday event tables which are name
like events_intraday_20200721

how to calculate quantity using BigQuery in GA

I am trying to calculate number of transaction, quantity, revenue for last 14 days.
What I've got so far is
SELECT
sum(totals.transactions) AS Transaction ,
sum(hits.product.productQuantity) AS quantity,
sum(totals.transactionRevenue)/1000000 AS Revenue
FROM TABLE_DATE_RANGE([bigquery-public-data.google_analytics_sample.ga_sessions_],
TIMESTAMP('2019-10-01'), TIMESTAMP('2019-10-14'));
I get transaction and revenue same as what I've got from custom report but some how I get different number for quantity.
Am I doing something wrong or using wrong table?
I suppose to get 63 for the quantity but I get 2420 when I run the big query above.
Thanks in advance!
The query in the question produces no results. Thanks for choosing a public dataset source (so I could run the query), but there are no tables within that time range.
The same query over an existing time period:
SELECT
sum(totals.transactions) AS Transaction ,
sum(hits.product.productQuantity) AS quantity,
sum(totals.transactionRevenue)/1000000 AS Revenue
FROM TABLE_DATE_RANGE([bigquery-public-data.google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-07-01'), TIMESTAMP('2017-07-12'));
317 27804 33020.66
Then I rewrote it as a #standardSQL query - to see if there is some implicit flattening that creates incorrect results. In #standardSQL you have to do explicit flattening, which I did like this:
SELECT Transaction
, (SELECT SUM((SELECT SUM(productQuantity) FROM UNNEST(product))) FROM UNNEST(hitsarray)) AS quantity
, Revenue
FROM (
SELECT SUM(totals.transactions) AS Transaction
, SUM(totals.transactionRevenue)/1000000 AS Revenue
, ARRAY_CONCAT_AGG(hits) hitsarray
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _table_suffix BETWEEN '20170701' AND '20170712'
)
317 27804 33020.66
And you can see that it gives me the same results. Is the ARRAY_CONCAT_AGG() + SUM((SELECT SUM()) the correct way to unnest the hits and product data within? Well, depends on why you were expecting a different value. Please make that clear in the question.

Does BigQuery include intraday tables when I query over all dates up to current date?

Google Analytics 360 data in BigQuery has two intraday tables for the past two days, and permanent partitioned tables for the dates before that. When I run a query on the ga_sessions_ tables for the past 30 days, does this automatically include the two days' data in the ga_sessions_intraday_ tables or do I have to include them specifically?
Edit; here is a query that illustrates this:
SELECT date, visitId, totals.transactions
FROMdataset.ga_sessions_2018*
WHERE
_TABLE_SUFFIX BETWEEN "0401"
AND CAST(CURRENT_DATE() as STRING)
ORDER BY date DESC
The result is that the most recent date is two days ago (ie not including intraday tables.) That's my question answered I guess, thanks anyway.
You can query across whatever tables you want; just write a filter that matches the right suffixes. For example,
SELECT date, visitId, totals.transactions, _TABLE_SUFFIX AS suffix
FROM `dataset.ga_sessions_*` WHERE REGEXP_EXTRACT(_TABLE_SUFFIX, r'[0-9]+')
BETWEEN "20180401" AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
ORDER BY date DESC
I put the suffix in the select list so you can tell which table is matched.

Sessions by hits.page.pagePath in GA bigquery tables

I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a

Resources