I am trying to calculate number of transaction, quantity, revenue for last 14 days.
What I've got so far is
SELECT
sum(totals.transactions) AS Transaction ,
sum(hits.product.productQuantity) AS quantity,
sum(totals.transactionRevenue)/1000000 AS Revenue
FROM TABLE_DATE_RANGE([bigquery-public-data.google_analytics_sample.ga_sessions_],
TIMESTAMP('2019-10-01'), TIMESTAMP('2019-10-14'));
I get transaction and revenue same as what I've got from custom report but some how I get different number for quantity.
Am I doing something wrong or using wrong table?
I suppose to get 63 for the quantity but I get 2420 when I run the big query above.
Thanks in advance!
The query in the question produces no results. Thanks for choosing a public dataset source (so I could run the query), but there are no tables within that time range.
The same query over an existing time period:
SELECT
sum(totals.transactions) AS Transaction ,
sum(hits.product.productQuantity) AS quantity,
sum(totals.transactionRevenue)/1000000 AS Revenue
FROM TABLE_DATE_RANGE([bigquery-public-data.google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-07-01'), TIMESTAMP('2017-07-12'));
317 27804 33020.66
Then I rewrote it as a #standardSQL query - to see if there is some implicit flattening that creates incorrect results. In #standardSQL you have to do explicit flattening, which I did like this:
SELECT Transaction
, (SELECT SUM((SELECT SUM(productQuantity) FROM UNNEST(product))) FROM UNNEST(hitsarray)) AS quantity
, Revenue
FROM (
SELECT SUM(totals.transactions) AS Transaction
, SUM(totals.transactionRevenue)/1000000 AS Revenue
, ARRAY_CONCAT_AGG(hits) hitsarray
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _table_suffix BETWEEN '20170701' AND '20170712'
)
317 27804 33020.66
And you can see that it gives me the same results. Is the ARRAY_CONCAT_AGG() + SUM((SELECT SUM()) the correct way to unnest the hits and product data within? Well, depends on why you were expecting a different value. Please make that clear in the question.
Related
I am trying to calculate the total time spent by users on my app. We have integrated firebase analytics data in BigQuery. Can I use the sum of the values of engagement_time_msec/1000 in the select statement of my query? This is what I am trying :
SELECT SUM(x.value.int_value) FROM "[dataset]", UNNEST(event_params) AS x WHERE x.key = "engagement_time_msec"
I am getting very big values after executing this query(it giving huge hours per day). I am not sure if is it ok to use SUM("engagement_time_msec") for calculating the total time spent by users on the app.
I am not expecting that users are spending this much time on the app. Is it the right way to calculate engagement_time, or which is the best event to calculate the engagement_time?
Any help would be highly appreciated.
As per google analytics docs in regards to engagement_time_sec, this field is defined as "The additional engagement time (ms) since the last user_engagement event". Therefore, if you only look at this, you are losing all the previous time spent by users before the mentioned user_engagement event is triggered.
What I'd do, since now ga_session_id is defined, would be to grab the maximum and minimum for each ga_session_id timestamp, use the TIMESTAMP_DIFF() function for each case, and sum the results of all the sessions for a given day:
WITH ga_sessions AS (
SELECT
event_timestamp,
event_date,
params.value.int_value AS ga_session_id
FROM
`analytics_123456789.events_*`, UNNEST(event_params) AS params
WHERE
params.key = "ga_session_id"
),
session_length AS (
SELECT
event_date,
TIMESTAMP_DIFF(MAX(TIMESTAMP_MICROS(event_timestamp)), MIN(TIMESTAMP_MICROS(event_timestamp)), SECOND) AS session_duration_seconds
FROM
ga_sessions
WHERE
ga_session_id IS NOT NULL
GROUP BY
1
),
final AS (
SELECT
event_date,
SUM(session_duration_seconds) as total_seconds_in_app
FROM
session_length
GROUP BY
1
ORDER BY
1 DESC
)
SELECT * FROM final
OUTPUT (data extracted from the app I work at):
event_date | total_seconds_in_app
-----------+--------------------
20210920 | 45600
20210919 | 43576
20210918 | 44539
I have the following data in my table:
I need the output to be the following in Snowflake:
It is basically, order by transaction date and getting the first transaction and the last transaction for the country and city and the count of transactions as they are done in sequence. I tried using window functions but I'm not getting the desired result. The tricky part if you can see is that the grouping has to be done but in sequence. You can see TEXAS and CALIFORNIA repeating depending on the sequence of transactions for the country and city.
Best it can be via a query. Second best, in some other way of computation that is fast. Has to be done on batches of data. I don't really want to go to an approach where the data is pulled in an order and then gone through row by row in a sequence unless that is the only option. Open to advises on that as well. Thanks!
Hint: GROUP BY, MIN, MAX, COUNT
I was able to find a logic and the following query works:
select countryid, regionid, min(requesttime), max(requesttime), count(*) from (select deviceid,countryid,regionid,cityid, requesttime,
row_number() over (partition by countryid order by requesttime) as seqnum_1,
row_number() over (partition by countryid, regionid order by requesttime) as seqnum_2
from table t order by requesttime
) t group by countryid, regionid, (seqnum_1 - seqnum_2) order by min(requesttime);
I'm trying to calculate the total quantities purchased for individual SKU's between certain dates. Final output should be date / SKU / Qty_sold
My dataset is the Google Analytics sample public dataset.
Main issue: When I try to run the below query using item.itemQuantity, I get the below error:
Syntax error: Unexpected keyword UNNEST at [6:1]
If you see the screenshot for item.itemQuantity, it seems to be nested. By adding the UNNEST function, it's supposed to flatten the table and get the count. This is my understanding of UNNEST. However, when I apply UNNEST, the query doesn't run.
Second issue: When I check the BQ GA schema, the definitions for hits.item.itemQuantity and hits.product.productQuantity seem to be the same? and I'm unable to differentiate between the two fields and which one should I use in my query.
https://support.google.com/analytics/answer/3437719?hl=en
hits.product.productQuantity INTEGER The quantity of the product purchased.
hits.item.itemQuantity INTEGER The quantity of the product sold.
Can anyone please explain how I can improve this query to get my desired result ? Thx.
SELECT
date,
hits.item.productSKU AS SKU,
SUM(hits.item.itemQuantity) AS qty_sold
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
UNNEST (hits) hit
WHERE _TABLE_SUFFIX
BETWEEN
'20160801' AND '20160802'
Try below for hits.product
SELECT
date,
prod.productSKU AS SKU,
SUM(prod.productQuantity) AS qty_purchased
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST (hits) hit, UNNEST(product) prod
WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20160802'
GROUP BY date, SKU
or below for hits.item
SELECT
date,
hit.item.productSKU AS SKU,
SUM(hit.item.itemQuantity) AS qty_sold
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST (hits) hit
WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20160802'
GROUP BY date, SKU
I would like to build the following table every day, to store some aggregate data on page performance of a website. However, each days worth of data is over 15 million rows.
What steps can I take to improve performance? I am intending to save them as sharded tables, but I would like to improve further, could I nest the data within each table to improve performance further? What would be the best way to do this?
SELECT
device.devicecategory AS device,
hits_product.productListName AS list_name,
UPPER(hits_product.productSKU) AS SKU,
AVG(hits_product.productListPosition) AS avg_plp_position
FROM `mindful-agency-136314.43786551.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(hits.product) AS hits_product
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND hits_product.productListName != "(not set)"
GROUP BY
device,
list_name,
SKU
Since you're using productSku and productListName as dimensions/groups there is no way around cross joining with product array.
You're also cross joining with product which can be dangerous because sometimes this array is missing and you destroy the whole row - typically you'd use a left join. But in this case, it's fine because you're only interested in product fields.
You should, however, be clear about whether you want to see list clicks or list impressions using hits.product.isImpression and hits.product.isClick. Atm I don't see a distinction there. Maybe filter for WHERE hits_product.isImpression in case of list views?
Instead of shards you might want to consider adding a date field and PARTITION BY date as well as CLUSTER BY list_name. See INSERT Statement for updating
and DDL Syntax to start the table. This is more performant than shards when it comes to querying the table later.
Starting the table could look something like this:
CREATE TABLE `x.y.z`
PARTITION BY date
CLUSTER BY list_name
AS (
SELECT
PARSE_DATE('%Y%m%d',date) AS date,
device.devicecategory AS device,
hits_product.productListName AS list_name,
UPPER(hits_product.productSKU) AS SKU,
AVG(IF(hits_product.isClick, hits_product.productListPosition, NULL)) AS avg_plp_click_position,
AVG(IF(hits_product.isImpression, hits_product.productListPosition, NULL)) AS avg_plp_view_position
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(hits.product) AS hits_product
WHERE
parse_date('%y%m%d', _table_suffix)
between
DATE_sub(current_date(), interval 1 day)
and DATE_sub(current_date(), interval 1 day)
AND hits_product.productListName != "(not set)"
GROUP BY
date,
device,
list_name,
SKU
)
Inserting new records is quite similar, you just need to mention the fields upfront as described in the documentation.
I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a