BigQuery + Google Analytics: Calculating quantities purchased by SKU. UNNEST not working - google-analytics

I'm trying to calculate the total quantities purchased for individual SKU's between certain dates. Final output should be date / SKU / Qty_sold
My dataset is the Google Analytics sample public dataset.
Main issue: When I try to run the below query using item.itemQuantity, I get the below error:
Syntax error: Unexpected keyword UNNEST at [6:1]
If you see the screenshot for item.itemQuantity, it seems to be nested. By adding the UNNEST function, it's supposed to flatten the table and get the count. This is my understanding of UNNEST. However, when I apply UNNEST, the query doesn't run.
Second issue: When I check the BQ GA schema, the definitions for hits.item.itemQuantity and hits.product.productQuantity seem to be the same? and I'm unable to differentiate between the two fields and which one should I use in my query.
https://support.google.com/analytics/answer/3437719?hl=en
hits.product.productQuantity INTEGER The quantity of the product purchased.
hits.item.itemQuantity INTEGER The quantity of the product sold.
Can anyone please explain how I can improve this query to get my desired result ? Thx.
SELECT
date,
hits.item.productSKU AS SKU,
SUM(hits.item.itemQuantity) AS qty_sold
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
UNNEST (hits) hit
WHERE _TABLE_SUFFIX
BETWEEN
'20160801' AND '20160802'

Try below for hits.product
SELECT
date,
prod.productSKU AS SKU,
SUM(prod.productQuantity) AS qty_purchased
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST (hits) hit, UNNEST(product) prod
WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20160802'
GROUP BY date, SKU
or below for hits.item
SELECT
date,
hit.item.productSKU AS SKU,
SUM(hit.item.itemQuantity) AS qty_sold
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST (hits) hit
WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20160802'
GROUP BY date, SKU

Related

Aggregating on groups of data order by date in Snowflake

I have the following data in my table:
I need the output to be the following in Snowflake:
It is basically, order by transaction date and getting the first transaction and the last transaction for the country and city and the count of transactions as they are done in sequence. I tried using window functions but I'm not getting the desired result. The tricky part if you can see is that the grouping has to be done but in sequence. You can see TEXAS and CALIFORNIA repeating depending on the sequence of transactions for the country and city.
Best it can be via a query. Second best, in some other way of computation that is fast. Has to be done on batches of data. I don't really want to go to an approach where the data is pulled in an order and then gone through row by row in a sequence unless that is the only option. Open to advises on that as well. Thanks!
Hint: GROUP BY, MIN, MAX, COUNT
I was able to find a logic and the following query works:
select countryid, regionid, min(requesttime), max(requesttime), count(*) from (select deviceid,countryid,regionid,cityid, requesttime,
row_number() over (partition by countryid order by requesttime) as seqnum_1,
row_number() over (partition by countryid, regionid order by requesttime) as seqnum_2
from table t order by requesttime
) t group by countryid, regionid, (seqnum_1 - seqnum_2) order by min(requesttime);

how to calculate quantity using BigQuery in GA

I am trying to calculate number of transaction, quantity, revenue for last 14 days.
What I've got so far is
SELECT
sum(totals.transactions) AS Transaction ,
sum(hits.product.productQuantity) AS quantity,
sum(totals.transactionRevenue)/1000000 AS Revenue
FROM TABLE_DATE_RANGE([bigquery-public-data.google_analytics_sample.ga_sessions_],
TIMESTAMP('2019-10-01'), TIMESTAMP('2019-10-14'));
I get transaction and revenue same as what I've got from custom report but some how I get different number for quantity.
Am I doing something wrong or using wrong table?
I suppose to get 63 for the quantity but I get 2420 when I run the big query above.
Thanks in advance!
The query in the question produces no results. Thanks for choosing a public dataset source (so I could run the query), but there are no tables within that time range.
The same query over an existing time period:
SELECT
sum(totals.transactions) AS Transaction ,
sum(hits.product.productQuantity) AS quantity,
sum(totals.transactionRevenue)/1000000 AS Revenue
FROM TABLE_DATE_RANGE([bigquery-public-data.google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-07-01'), TIMESTAMP('2017-07-12'));
317 27804 33020.66
Then I rewrote it as a #standardSQL query - to see if there is some implicit flattening that creates incorrect results. In #standardSQL you have to do explicit flattening, which I did like this:
SELECT Transaction
, (SELECT SUM((SELECT SUM(productQuantity) FROM UNNEST(product))) FROM UNNEST(hitsarray)) AS quantity
, Revenue
FROM (
SELECT SUM(totals.transactions) AS Transaction
, SUM(totals.transactionRevenue)/1000000 AS Revenue
, ARRAY_CONCAT_AGG(hits) hitsarray
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _table_suffix BETWEEN '20170701' AND '20170712'
)
317 27804 33020.66
And you can see that it gives me the same results. Is the ARRAY_CONCAT_AGG() + SUM((SELECT SUM()) the correct way to unnest the hits and product data within? Well, depends on why you were expecting a different value. Please make that clear in the question.

Does BigQuery include intraday tables when I query over all dates up to current date?

Google Analytics 360 data in BigQuery has two intraday tables for the past two days, and permanent partitioned tables for the dates before that. When I run a query on the ga_sessions_ tables for the past 30 days, does this automatically include the two days' data in the ga_sessions_intraday_ tables or do I have to include them specifically?
Edit; here is a query that illustrates this:
SELECT date, visitId, totals.transactions
FROMdataset.ga_sessions_2018*
WHERE
_TABLE_SUFFIX BETWEEN "0401"
AND CAST(CURRENT_DATE() as STRING)
ORDER BY date DESC
The result is that the most recent date is two days ago (ie not including intraday tables.) That's my question answered I guess, thanks anyway.
You can query across whatever tables you want; just write a filter that matches the right suffixes. For example,
SELECT date, visitId, totals.transactions, _TABLE_SUFFIX AS suffix
FROM `dataset.ga_sessions_*` WHERE REGEXP_EXTRACT(_TABLE_SUFFIX, r'[0-9]+')
BETWEEN "20180401" AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
ORDER BY date DESC
I put the suffix in the select list so you can tell which table is matched.

Results of joined queries in BQ don't match data in Google Analytics

Background
In BigQuery, I'm trying to find the number of visitors that both visit one of two pages and purchase a specific product.
When I run each of the sub-queries, the numbers match exactly what I see in Google Analytics.
However, when I join them, the number is different than what I see in GA. I've had someone bring the results of the two sub-queries into Excel and do the equivalent, and their results equal what I'm seeing in BQ.
Details
Here's the query:
SELECT
ProductSessions.date AS date,
SUM(ProductTransactions.totalTransactions) transactions,
COUNT(ProductSessions.visitId) visited_product_sessions
FROM (
SELECT
visitId, date
FROM
`103554833.ga_sessions_20170219`
WHERE
EXISTS(
SELECT 1 FROM UNNEST(hits) h
WHERE REGEXP_CONTAINS(h.page.pagePath, r"^www.domain.com/(product|product2).html.*"))
GROUP BY visitID, date)
AS ProductSessions
LEFT JOIN (
SELECT
totals.transactions as totalTransactions,
visitId,
date
FROM
`103554833.ga_sessions_20170219`
WHERE
totals.transactions IS NOT NULL
AND EXISTS(
SELECT 1
FROM
UNNEST(hits) h,
UNNEST(h.product) prod
WHERE REGEXP_CONTAINS(prod.v2ProductName, r"^Product®$"))
GROUP BY
visitId, totals.transactions,
date) AS ProductTransactions
ON
ProductTransactions.visitId = ProductSessions.visitId
WHERE ProductTransactions.visitId is not null
GROUP BY
date
ORDER BY
date ASC
I'm expecting ProductTransactions.totalTransactions to replicate the number of transactions in Google Analytics when filtered with an advanced segment of both:
Sessions include Page matching RegEx: www.domain.com/(product|product2).html.*
Sessions include Product matches exactly: Product®
However, results in BG are about 20% higher than in GA.
Why the difference?

Sessions by hits.page.pagePath in GA bigquery tables

I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a

Resources