Count on one custom dimension and group by another with BigQuery - google-analytics

I'm totally new to BigQuery, please excuse me for any obvious mistake. I'm trying to build a query where I can count the number of distinct element from one custom dimension and group this by another custom dimension.
I tried this but It doesn't work :
SELECT
MAX(IF(hits.customDimensions.index=7,hits.customDimensions.value,NULL)) AS Author,
COUNT(MAX(IF(hits.customDimensions.index=10,hits.customDimensions.value,NULL))) AS Articles
FROM (
SELECT
*
FROM
TABLE_DATE_RANGE([blablabla-blabla-115411:104672022.ga_sessions_test], TIMESTAMP('20160927'), TIMESTAMP('20161024'))) AS t0
GROUP BY
MAX(IF(hits.customDimensions.index=7,hits.customDimensions.value,NULL)) AS Author,

Using standard SQL (uncheck "Use Legacy SQL" under "Show Options"), does this query work? For each entry in hits, it selects the value for an index of 7 as the author, and then counts the number of entries where index is 10 as the number of articles. It makes the assumption that there is at most one entry with an index of 7 in customDimensions.
SELECT
(SELECT value FROM UNNEST(hits.customDimensions)
WHERE index = 7) AS Author,
SUM((SELECT COUNT(*) FROM UNNEST(hits.customDimensions)
WHERE index = 10)) AS Articles
FROM
`your-dataset.ga_sessions_test` AS t, UNNEST(t.hits) AS hits
WHERE _PARTITIONTIME BETWEEN '2016-09-27' AND '2016-10-24'
GROUP BY Author;

Related

Custom dimensions in google BigQuery

I'm using big query and am trying to import custom dimensions along with noncustom dimensions. The analytics is sent from an app and basically I want a table with columns: UserID (custom dimension), platformID (custom dimension), ScreenName (basically app version of "Page name"), and date. The metric is "number of screenviews" grouped onto all of these dimensions. This is what it looks like below:
The photo of the GA report:
So, in bigquery, I could get numbers that checked out (when compared to GA report above) until I added in custom dimensions. Once I added custom dimensions, the numbers no longer made any sense.
I know that custom dimensions are nested within big query. So I made sure to use FLATTEN at first. Then I tried without flatten and got same results. The numbers make no sense (are hundreds of times larger than in GA interface).
My queries are below (one without FLATTEN and one with FLATTEN).
ps I ideally wanted to use
count(hits)
instead of
count(hits.appInfo.screenName)
But I kept getting an error when I selected hits in my subquery.
My query without flatten is below. If you could help me figure out why is it that once I add custom dimensions all data gets messed up
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
UserIdd,
platform,
count(hits.appInfo.screenName)
FROM (
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
max(case when hits.customdimensions.index = 20 then hits.customdimensions.value end) within record as platform
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))
)
where UserIdd is not null
and platform = 'Android'
GROUP BY
1,
2,
3,
4,
5
ORDER BY
6 DESC
and here is my query with FLATTEN (same issue - numbers dont make sense)
SELECT
date,
hits.appInfo.version,
customDimensions.index,
customDimensions.value,
hits.appInfo.screenName,
UserIdd,
count(hits.appInfo.screenName)
FROM (FLATTEN(( FLATTEN((
SELECT
date,
hits.appInfo.version,
customDimensions.value,
customDimensions.index,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))), customDimensions.value)),hits.type))
WHERE
customDimensions.value = 'Android'
and customDimensions.index = 20
and UserIdd is not null
GROUP BY
1,
2,
3,
4,
5,
6
ORDER BY
7 DESC
I'm not positive that hits.customDimensions.* will always have the user-scoped dimensions (and I'm guessing your userId metric is user-scoped).
Specifically, user-scoped dimensions should be queried from customDimensions, not hits.customDimensions.
Notionally, the first step is to make customDimensions compatible with hits.* via flattening or scoped aggregation. I'll explain the flattening approach.
GA records have the shape (customDimensions[], hits[], ...), which is no good for querying both fields. We begin by flattening these to (customDimensionN, hits[], ...).
One level up, by selecting fields under hits.*, we implicitly flatten the table into (customDimensionN, hitN) records. We filter these to include only the records matching (customDimension5, appviewN).
The last step is to count everything up.
SELECT date, v, sn, uid, COUNT(*)
FROM (
SELECT
date,
hits.appInfo.version v,
hits.appInfo.screenName sn,
customDimensions.value uid
FROM
FLATTEN((
SELECT customDimensions.*, hits.*, date
FROM
TABLE_DATE_RANGE(
[fiery-cabinet-97820:87025718.ga_sessions_],
TIMESTAMP('2017-04-04'),
TIMESTAMP('2017-04-04'))),
customDimensions)
WHERE hits.type = "APPVIEW" and customDimensions.index = 5)
GROUP BY 1,2,3,4
ORDER BY 5 DESC
Here's another equivalent approach. This uses the scoped aggregation trick that I've seen recommended in the GA BQ cookbook. Looking at the query explanation, however, the MAX(IF(...)) WITHIN RECORD seems to be quite expensive, triggering an extra COMPUTE and AGGREGATE phase in the first stage. Bonus points for being a bit more digestible, though.
SELECT sn, uid, date, v, COUNT(*)
FROM (
SELECT
MAX(IF(customDimensions.index = 5, customDimensions.value, null)) within record as uid,
hits.appInfo.screenname as sn,
date,
hits.appInfo.version as v,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04')))
WHERE hits.type = "APPVIEW" and uid is not null
GROUP BY 1,2,3,4
ORDER BY 5 DESC
I'm not yet familiar with the Standard SQL dialect of BQ, but it seems that it would simplify this kind of wrangling. You might want to wrap your head around that if you'll be making many queries like this.

getting page views by two custom dimensions of different granularity in bigquery

I'm trying to pull a report from bigquery where I can see pageviews segmented by day and couple of custom dimensions (one at hit level and the other at session level) with this query:
SELECT
date
,SUM(totals.pageviews) as PVs
,MAX(IF(hits.customDimensions.index = 11, hits.customDimensions.value,NULL)) AS x
,MAX(IF(customDimensions.index = 1, customDimensions.value,NULL)) AS y
FROM TABLE_DATE_RANGE([111111111.ga_sessions_]
,TIMESTAMP('2016-10-01')
,TIMESTAMP('2016-10-31'))
GROUP EACH BY 1
I get the following:
Error: Cannot query the cross product of repeated fields customDimensions.index and hits.page.pagePath.
I've been looking at other answers but didn't find anything addressing a similar enough issue. Could you suggest a better query?
Thanks!
you need to flatten your data
take a look at Google's example reporting "Cannot query the cross product of repeated fields children.age and citiesLived.yearsLived" within Dealing with data
"To query across more than one repeated field, you need to flatten one of the fields:
SELECT
fullName,
age,
gender,
citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
(citiesLived.yearsLived > 1995) AND
(children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place"
to get around the table_date_range limitation, try creating a sub select first
SELECT
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
customDimensions.value
FROM
FLATTEN((
SELECT
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
customDimensions.index,
customDimensions.value
FROM (TABLE_DATE_RANGE([dataset.table_], DATE_ADD(CURRENT_TIMESTAMP(), -3, 'DAY'), DATE_ADD(CURRENT_TIMESTAMP(), -1, 'DAY')))),hits.eventInfo.eventCategory)
as discussed on Official Google BigQuery issue and feature request tracker

SQLite: SELECT from grouped and ordered result

I'm new to SQL(ite), so i'm sorry if there is a simple answer i just were to stupid to find the right search terms for.
I got 2 tables: 1 for user information and another holding points a user achieved. It's a simple one to many relation (a user can achieve points multiple times).
table1 contains "userID" and "Username" ...
table2 contains "userID" and "Amount" ...
Now i wanted to get a highscore rank for a given username.
To get the highscore i did:
SELECT Username, SUM(Amount) AS total FROM table2 JOIN table1 USING (userID) GROUP BY Username ORDER BY total DESC
How could i select a single Username and get its position from the grouped and ordered result? I have no idea how a subselect would've to look like for my goal. Is it even possible in a single query?
You cannot calculate the position of the user without referencing the other data. SQLite does not have a ranking function which would be ideal for your user case, nor does it have a row number feature that would serve as an acceptable substitute.
I suppose the closest you could get would be to drop this data into a temp table that has an incrementing ID, but I think you'd get very messy there.
It's best to handle this within the application. Get all the users and calculate rank. Cache individual user results as necessary.
Without knowing anything more about the operating context of the app/DB it's hard to provide a more specific recommendation.
For a specific user, this query gets the total amount:
SELECT SUM(Amount)
FROM Table2
WHERE userID = ?
You have to count how many other users have a higher amount than that single user:
SELECT COUNT(*)
FROM table1
WHERE (SELECT SUM(Amount)
FROM Table2
WHERE userID = table1.userID)
>=
(SELECT SUM(Amount)
FROM Table2
WHERE userID = ?);

Sessions by hits.page.pagePath in GA bigquery tables

I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a

How to create database table dynamically and insert data selected by query

I'm working on website where I need to find rank of user on the basis of score. Earlier I'm calculating the score and rank of user by sql query .
select * from (
select
usrid,
ROW_NUMBER()
OVER(ORDER BY (count(*)+sum(sup)+sum(opp)+sum(visited)*0.3) DESC) AS rank,
(count(*)+sum(sup)+sum(opp)+sum(visited)*0.3 ) As score
from [DB_].[dbo].[dsas]
group by usrid) as cash
where usrid=#userid
Please don't concentrate more on query because this is only to explain how I select data.
Problem: Now I can't use above query because every time I use rank it need to select rank from dsas table and data of dsas table is increasing day by day and slows down my website.
What I need is select data by above query and insert in another table named as score. Can we do anything like this?
A better solution is to either include score as a field in your user table or have a separate table for scores. Any time you add new sup, opp, or visited data for a user, also recalculate their score at that time.
Then to get the highest ranking users, you will be able to perform a very simple select statement, ordering by score descending, and only fetching the number of rows you want. It will be very fast.

Resources