I am basically trying to query a column called hits.CustomDimension.index , so I had to nest multiple times in order to be able to access the data. I am trying to filter so it only shows up the rows where hits.CustomDimension.index = 16 with the query below, but its returning all of the CustomDimension rows for any observation which has CustomDimension.index = 16 somewhere.
Not sure what I am doing wrong? As you can see in the image I added, still all the custom dimensions where customDimension.index = 16 are appearing, i only want that one flattened.
SELECT * EXCEPT(hit, hits)
FROM ***,
UNNEST(hit) h
CROSS JOIN UNNEST(customDimensions) cd
WHERE cd.index = 16 AND timeOnSite IS NOT NULL
Try below
SELECT *
FROM ******.ga_sessions_export,
UNNEST(hits) h
CROSS JOIN UNNEST(h.customDimensions) cd
WHERE cd.index = 16
LIMIT 10
The problem in your original query is that you filtered not by hits.customDimensions but rather by separate field which is named customDimensions
vs.
Custom dimensions are stored as an array. If you only want to see your specific custom dimension, you can exclude hits and hit columns from the select statement.
SELECT * EXCEPT(hits, hit),
FROM ******.ga_sessions_export,
UNNEST(hits) hit,
UNNEST(customDimensions) cd
WHERE cd.index = 16
LIMIT 10
Related
I'm using big query and am trying to import custom dimensions along with noncustom dimensions. The analytics is sent from an app and basically I want a table with columns: UserID (custom dimension), platformID (custom dimension), ScreenName (basically app version of "Page name"), and date. The metric is "number of screenviews" grouped onto all of these dimensions. This is what it looks like below:
The photo of the GA report:
So, in bigquery, I could get numbers that checked out (when compared to GA report above) until I added in custom dimensions. Once I added custom dimensions, the numbers no longer made any sense.
I know that custom dimensions are nested within big query. So I made sure to use FLATTEN at first. Then I tried without flatten and got same results. The numbers make no sense (are hundreds of times larger than in GA interface).
My queries are below (one without FLATTEN and one with FLATTEN).
ps I ideally wanted to use
count(hits)
instead of
count(hits.appInfo.screenName)
But I kept getting an error when I selected hits in my subquery.
My query without flatten is below. If you could help me figure out why is it that once I add custom dimensions all data gets messed up
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
UserIdd,
platform,
count(hits.appInfo.screenName)
FROM (
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
max(case when hits.customdimensions.index = 20 then hits.customdimensions.value end) within record as platform
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))
)
where UserIdd is not null
and platform = 'Android'
GROUP BY
1,
2,
3,
4,
5
ORDER BY
6 DESC
and here is my query with FLATTEN (same issue - numbers dont make sense)
SELECT
date,
hits.appInfo.version,
customDimensions.index,
customDimensions.value,
hits.appInfo.screenName,
UserIdd,
count(hits.appInfo.screenName)
FROM (FLATTEN(( FLATTEN((
SELECT
date,
hits.appInfo.version,
customDimensions.value,
customDimensions.index,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))), customDimensions.value)),hits.type))
WHERE
customDimensions.value = 'Android'
and customDimensions.index = 20
and UserIdd is not null
GROUP BY
1,
2,
3,
4,
5,
6
ORDER BY
7 DESC
I'm not positive that hits.customDimensions.* will always have the user-scoped dimensions (and I'm guessing your userId metric is user-scoped).
Specifically, user-scoped dimensions should be queried from customDimensions, not hits.customDimensions.
Notionally, the first step is to make customDimensions compatible with hits.* via flattening or scoped aggregation. I'll explain the flattening approach.
GA records have the shape (customDimensions[], hits[], ...), which is no good for querying both fields. We begin by flattening these to (customDimensionN, hits[], ...).
One level up, by selecting fields under hits.*, we implicitly flatten the table into (customDimensionN, hitN) records. We filter these to include only the records matching (customDimension5, appviewN).
The last step is to count everything up.
SELECT date, v, sn, uid, COUNT(*)
FROM (
SELECT
date,
hits.appInfo.version v,
hits.appInfo.screenName sn,
customDimensions.value uid
FROM
FLATTEN((
SELECT customDimensions.*, hits.*, date
FROM
TABLE_DATE_RANGE(
[fiery-cabinet-97820:87025718.ga_sessions_],
TIMESTAMP('2017-04-04'),
TIMESTAMP('2017-04-04'))),
customDimensions)
WHERE hits.type = "APPVIEW" and customDimensions.index = 5)
GROUP BY 1,2,3,4
ORDER BY 5 DESC
Here's another equivalent approach. This uses the scoped aggregation trick that I've seen recommended in the GA BQ cookbook. Looking at the query explanation, however, the MAX(IF(...)) WITHIN RECORD seems to be quite expensive, triggering an extra COMPUTE and AGGREGATE phase in the first stage. Bonus points for being a bit more digestible, though.
SELECT sn, uid, date, v, COUNT(*)
FROM (
SELECT
MAX(IF(customDimensions.index = 5, customDimensions.value, null)) within record as uid,
hits.appInfo.screenname as sn,
date,
hits.appInfo.version as v,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04')))
WHERE hits.type = "APPVIEW" and uid is not null
GROUP BY 1,2,3,4
ORDER BY 5 DESC
I'm not yet familiar with the Standard SQL dialect of BQ, but it seems that it would simplify this kind of wrangling. You might want to wrap your head around that if you'll be making many queries like this.
I'm totally new to BigQuery, please excuse me for any obvious mistake. I'm trying to build a query where I can count the number of distinct element from one custom dimension and group this by another custom dimension.
I tried this but It doesn't work :
SELECT
MAX(IF(hits.customDimensions.index=7,hits.customDimensions.value,NULL)) AS Author,
COUNT(MAX(IF(hits.customDimensions.index=10,hits.customDimensions.value,NULL))) AS Articles
FROM (
SELECT
*
FROM
TABLE_DATE_RANGE([blablabla-blabla-115411:104672022.ga_sessions_test], TIMESTAMP('20160927'), TIMESTAMP('20161024'))) AS t0
GROUP BY
MAX(IF(hits.customDimensions.index=7,hits.customDimensions.value,NULL)) AS Author,
Using standard SQL (uncheck "Use Legacy SQL" under "Show Options"), does this query work? For each entry in hits, it selects the value for an index of 7 as the author, and then counts the number of entries where index is 10 as the number of articles. It makes the assumption that there is at most one entry with an index of 7 in customDimensions.
SELECT
(SELECT value FROM UNNEST(hits.customDimensions)
WHERE index = 7) AS Author,
SUM((SELECT COUNT(*) FROM UNNEST(hits.customDimensions)
WHERE index = 10)) AS Articles
FROM
`your-dataset.ga_sessions_test` AS t, UNNEST(t.hits) AS hits
WHERE _PARTITIONTIME BETWEEN '2016-09-27' AND '2016-10-24'
GROUP BY Author;
I have a table Customer which has the following columns,
user_name,current_id,id,params,display,store.
I am writing a query like this,
SELECT * FROM Customer WHERE user_name='Mike' AND current_id='9845' AND id='Get_Owner' AND params='owner=1' order by(display) limit 6 offset 0
Now there are times when I want to fetch a particular value which is not there in the first six and I want to fetch that particular value and rest 5 values in the same way like above how can I do that?
For example I want something like this
SELECT * FROM Customer WHERE user_name='Mike' AND current_id='9845' and id='Get_Owner' AND params='owner=1' AND stored='Shelly.Am'
I want Shelly.Am and other 5 value like my first query
You can combine two queries by using a compound query.
The ORDER BY/LIMIT clauses would apply to the entire compound query, so the second query must be moved into a subquery:
SELECT *
FROM Customer
WHERE user_name='Mike'
AND current_id='9845'
AND id='Get_Owner'
AND params='owner=1'
AND stored='Shelly.Am'
UNION ALL
SELECT *
FROM (SELECT *
FROM Customer
WHERE user_name='Mike'
AND current_id='9845'
AND id='Get_Owner'
AND params='owner=1'
AND stored!='Shelly.Am'
ORDER BY display
LIMIT 5);
I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a
I have three tables (forward, reverse and probe) that all contain the same columns (Query, QueryLength, Hit, Start, End, Strand, Description, Length, Percent_id, Score, Evalue).
I want to get the unique rows from 'forward' where the 'Hit' is not found in either the 'reverse' table or the 'probe' table. With 'AND' I don't get any results, with 'OR' I get the comparison only with reverse.
CREATE TABLE f AS SELECT * FROM forward WHERE forward.Hit NOT IN (SELECT Hit from reverse) OR (SELECT Hit FROM probe)
Thanks for your help.
When you write ... OR (SELECT Hit FROM probe), you have a scalar subquery, i.e., the database looks only at the first value returned by the subquery, and treats it as 'true' if the value is non-zero.
To check whether a value is found in another table, you have to use IN (in both cases):
SELECT * FROM forward
WHERE Hit NOT IN (SELECT Hit FROM reverse)
AND Hit NOT IN (SELECT Hit FROM probe)
If you want to check entire records instead of only one column, you can use EXCEPT:
SELECT * FROM forward
EXCEPT
SELECT * FROM reverse
EXCEPT
SELECT * FROM probe