I'm using big query and am trying to import custom dimensions along with noncustom dimensions. The analytics is sent from an app and basically I want a table with columns: UserID (custom dimension), platformID (custom dimension), ScreenName (basically app version of "Page name"), and date. The metric is "number of screenviews" grouped onto all of these dimensions. This is what it looks like below:
The photo of the GA report:
So, in bigquery, I could get numbers that checked out (when compared to GA report above) until I added in custom dimensions. Once I added custom dimensions, the numbers no longer made any sense.
I know that custom dimensions are nested within big query. So I made sure to use FLATTEN at first. Then I tried without flatten and got same results. The numbers make no sense (are hundreds of times larger than in GA interface).
My queries are below (one without FLATTEN and one with FLATTEN).
ps I ideally wanted to use
count(hits)
instead of
count(hits.appInfo.screenName)
But I kept getting an error when I selected hits in my subquery.
My query without flatten is below. If you could help me figure out why is it that once I add custom dimensions all data gets messed up
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
UserIdd,
platform,
count(hits.appInfo.screenName)
FROM (
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
max(case when hits.customdimensions.index = 20 then hits.customdimensions.value end) within record as platform
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))
)
where UserIdd is not null
and platform = 'Android'
GROUP BY
1,
2,
3,
4,
5
ORDER BY
6 DESC
and here is my query with FLATTEN (same issue - numbers dont make sense)
SELECT
date,
hits.appInfo.version,
customDimensions.index,
customDimensions.value,
hits.appInfo.screenName,
UserIdd,
count(hits.appInfo.screenName)
FROM (FLATTEN(( FLATTEN((
SELECT
date,
hits.appInfo.version,
customDimensions.value,
customDimensions.index,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))), customDimensions.value)),hits.type))
WHERE
customDimensions.value = 'Android'
and customDimensions.index = 20
and UserIdd is not null
GROUP BY
1,
2,
3,
4,
5,
6
ORDER BY
7 DESC
I'm not positive that hits.customDimensions.* will always have the user-scoped dimensions (and I'm guessing your userId metric is user-scoped).
Specifically, user-scoped dimensions should be queried from customDimensions, not hits.customDimensions.
Notionally, the first step is to make customDimensions compatible with hits.* via flattening or scoped aggregation. I'll explain the flattening approach.
GA records have the shape (customDimensions[], hits[], ...), which is no good for querying both fields. We begin by flattening these to (customDimensionN, hits[], ...).
One level up, by selecting fields under hits.*, we implicitly flatten the table into (customDimensionN, hitN) records. We filter these to include only the records matching (customDimension5, appviewN).
The last step is to count everything up.
SELECT date, v, sn, uid, COUNT(*)
FROM (
SELECT
date,
hits.appInfo.version v,
hits.appInfo.screenName sn,
customDimensions.value uid
FROM
FLATTEN((
SELECT customDimensions.*, hits.*, date
FROM
TABLE_DATE_RANGE(
[fiery-cabinet-97820:87025718.ga_sessions_],
TIMESTAMP('2017-04-04'),
TIMESTAMP('2017-04-04'))),
customDimensions)
WHERE hits.type = "APPVIEW" and customDimensions.index = 5)
GROUP BY 1,2,3,4
ORDER BY 5 DESC
Here's another equivalent approach. This uses the scoped aggregation trick that I've seen recommended in the GA BQ cookbook. Looking at the query explanation, however, the MAX(IF(...)) WITHIN RECORD seems to be quite expensive, triggering an extra COMPUTE and AGGREGATE phase in the first stage. Bonus points for being a bit more digestible, though.
SELECT sn, uid, date, v, COUNT(*)
FROM (
SELECT
MAX(IF(customDimensions.index = 5, customDimensions.value, null)) within record as uid,
hits.appInfo.screenname as sn,
date,
hits.appInfo.version as v,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04')))
WHERE hits.type = "APPVIEW" and uid is not null
GROUP BY 1,2,3,4
ORDER BY 5 DESC
I'm not yet familiar with the Standard SQL dialect of BQ, but it seems that it would simplify this kind of wrangling. You might want to wrap your head around that if you'll be making many queries like this.
Related
Want to use multiple conditions on event_params.value.string with different event_param.key:
I have synced my firebase data to bigQuery and trying to visualize on Data studio.Data looks like this:
Now I have event_params.value.string field which has all the values as "app" ,"4G", "App_Open","DashboardOnionActivity" and
Event Param Name field has values like : ....,Action,Label
I want to count only those App_Open which has label as DashboardOnionActivity
I was using CASE with when and Then construct as :
CASE
WHEN REGEXP_MATCH(event_params.value.string, "(?i) App_Open") THEN "1-App_Open"
WHEN REGEXP_MATCH(event_params.value.string, "(?i)selfie_capture") THEN "2-selfie_capture"
ELSE "0"
END
This gives me the count of App_Open and selfie_capture but I am not sure how to apply 2 conditions as Param Name is different for both one is Action and other is Label
One workaround can be to have separate events as DashboardOnionActivity_App_Open and Others but looking for some efficient usage if any possible
Solved it by custom query option when adding data from BigQuery(bq) connector to Data Studio:
So as data obtained in bq from firebase is nested.
Need to use unnest() in subqueries to get the values out of it.
So used query in this format in a custom query :
SELECT
event_date,event_name,event_params,
CASE
WHEN (SELECT count(*) FROM UNNEST(event_params) i where i.value.string_value='App_Open') = 1 AND (SELECT count(*) FROM UNNEST(event_params) i where i.value.string_value='DashboardOnionActivity' ) =1 THEN "1-App_Open"
ELSE "other"
END App_status
From <table_name>
​where event_params is the struct field.
Let me break down the query and explain :
SELECT count(*) FROM UNNEST(event_params) i where i.value.string_value='App_Open' :
This counts the number of 'App_Open' happen in event_params.value.string_value for one event. As there is no duplicate event_params so it is [0,1]
Similarly for DashboardOnionActivity which is 0 when it doesn't occur 1 otherwise.
Once this is added as a data source in Data Studio count can be visualized by several charts like bar, inverted bar, scorecard, etc
please, do not flag as a duplicate question; I checked similar questions but could not find a solution
I am querying GA data in BigQuery and in particular I need to see pageviews by user ID, which is a custom dimension and so needs unnesting. The numbers do not match, though. They do match when I look at pageviews without the custom dimension, so something must be wrong with my query.
Any help will be much appreciated! Thank you.
SELECT
date AS Date,
MAX(CASE
WHEN cd.index=2 THEN cd.value
ELSE NULL
END) AS `Institution_ID`,
MAX(CASE
WHEN cd.index=3 THEN cd.value
ELSE NULL
END) AS `Institution_Name`
FROM `ga_sessions_*`, UNNEST(customDimensions) AS cd
GROUP BY
date
The comma is a lateral CROSS JOIN when applied to an array. This comes with some consequences: Is the array null, then the cross join results in null - the left sides is not preserved. Your table gets expanded for every entry in the array customDiemnsions.
You should always write subqueries on customDimensions arrays because semantically it doesn't make sense to expand the table with custom dimensions.
SELECT
date AS Date,
(SELECT value FROM UNNEST(customDimensions) WHERE index=2) AS `Institution_ID`,
(SELECT value FROM UNNEST(customDimensions) WHERE index=3) AS `Institution_Name`
FROM
`ga_sessions_*` AS t
GROUP BY
date
I'm trying to pull a report from bigquery where I can see pageviews segmented by day and couple of custom dimensions (one at hit level and the other at session level) with this query:
SELECT
date
,SUM(totals.pageviews) as PVs
,MAX(IF(hits.customDimensions.index = 11, hits.customDimensions.value,NULL)) AS x
,MAX(IF(customDimensions.index = 1, customDimensions.value,NULL)) AS y
FROM TABLE_DATE_RANGE([111111111.ga_sessions_]
,TIMESTAMP('2016-10-01')
,TIMESTAMP('2016-10-31'))
GROUP EACH BY 1
I get the following:
Error: Cannot query the cross product of repeated fields customDimensions.index and hits.page.pagePath.
I've been looking at other answers but didn't find anything addressing a similar enough issue. Could you suggest a better query?
Thanks!
you need to flatten your data
take a look at Google's example reporting "Cannot query the cross product of repeated fields children.age and citiesLived.yearsLived" within Dealing with data
"To query across more than one repeated field, you need to flatten one of the fields:
SELECT
fullName,
age,
gender,
citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
(citiesLived.yearsLived > 1995) AND
(children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place"
to get around the table_date_range limitation, try creating a sub select first
SELECT
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
customDimensions.value
FROM
FLATTEN((
SELECT
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
customDimensions.index,
customDimensions.value
FROM (TABLE_DATE_RANGE([dataset.table_], DATE_ADD(CURRENT_TIMESTAMP(), -3, 'DAY'), DATE_ADD(CURRENT_TIMESTAMP(), -1, 'DAY')))),hits.eventInfo.eventCategory)
as discussed on Official Google BigQuery issue and feature request tracker
I'm totally new to BigQuery, please excuse me for any obvious mistake. I'm trying to build a query where I can count the number of distinct element from one custom dimension and group this by another custom dimension.
I tried this but It doesn't work :
SELECT
MAX(IF(hits.customDimensions.index=7,hits.customDimensions.value,NULL)) AS Author,
COUNT(MAX(IF(hits.customDimensions.index=10,hits.customDimensions.value,NULL))) AS Articles
FROM (
SELECT
*
FROM
TABLE_DATE_RANGE([blablabla-blabla-115411:104672022.ga_sessions_test], TIMESTAMP('20160927'), TIMESTAMP('20161024'))) AS t0
GROUP BY
MAX(IF(hits.customDimensions.index=7,hits.customDimensions.value,NULL)) AS Author,
Using standard SQL (uncheck "Use Legacy SQL" under "Show Options"), does this query work? For each entry in hits, it selects the value for an index of 7 as the author, and then counts the number of entries where index is 10 as the number of articles. It makes the assumption that there is at most one entry with an index of 7 in customDimensions.
SELECT
(SELECT value FROM UNNEST(hits.customDimensions)
WHERE index = 7) AS Author,
SUM((SELECT COUNT(*) FROM UNNEST(hits.customDimensions)
WHERE index = 10)) AS Articles
FROM
`your-dataset.ga_sessions_test` AS t, UNNEST(t.hits) AS hits
WHERE _PARTITIONTIME BETWEEN '2016-09-27' AND '2016-10-24'
GROUP BY Author;
I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a