GA pageviews don't match BigQuery when unnesting for custom dimension? - google-analytics

please, do not flag as a duplicate question; I checked similar questions but could not find a solution
I am querying GA data in BigQuery and in particular I need to see pageviews by user ID, which is a custom dimension and so needs unnesting. The numbers do not match, though. They do match when I look at pageviews without the custom dimension, so something must be wrong with my query.
Any help will be much appreciated! Thank you.
SELECT
date AS Date,
MAX(CASE
WHEN cd.index=2 THEN cd.value
ELSE NULL
END) AS `Institution_ID`,
MAX(CASE
WHEN cd.index=3 THEN cd.value
ELSE NULL
END) AS `Institution_Name`
FROM `ga_sessions_*`, UNNEST(customDimensions) AS cd
GROUP BY
date

The comma is a lateral CROSS JOIN when applied to an array. This comes with some consequences: Is the array null, then the cross join results in null - the left sides is not preserved. Your table gets expanded for every entry in the array customDiemnsions.
You should always write subqueries on customDimensions arrays because semantically it doesn't make sense to expand the table with custom dimensions.
SELECT
date AS Date,
(SELECT value FROM UNNEST(customDimensions) WHERE index=2) AS `Institution_ID`,
(SELECT value FROM UNNEST(customDimensions) WHERE index=3) AS `Institution_Name`
FROM
`ga_sessions_*` AS t
GROUP BY
date

Related

Custom dimensions in google BigQuery

I'm using big query and am trying to import custom dimensions along with noncustom dimensions. The analytics is sent from an app and basically I want a table with columns: UserID (custom dimension), platformID (custom dimension), ScreenName (basically app version of "Page name"), and date. The metric is "number of screenviews" grouped onto all of these dimensions. This is what it looks like below:
The photo of the GA report:
So, in bigquery, I could get numbers that checked out (when compared to GA report above) until I added in custom dimensions. Once I added custom dimensions, the numbers no longer made any sense.
I know that custom dimensions are nested within big query. So I made sure to use FLATTEN at first. Then I tried without flatten and got same results. The numbers make no sense (are hundreds of times larger than in GA interface).
My queries are below (one without FLATTEN and one with FLATTEN).
ps I ideally wanted to use
count(hits)
instead of
count(hits.appInfo.screenName)
But I kept getting an error when I selected hits in my subquery.
My query without flatten is below. If you could help me figure out why is it that once I add custom dimensions all data gets messed up
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
UserIdd,
platform,
count(hits.appInfo.screenName)
FROM (
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
max(case when hits.customdimensions.index = 20 then hits.customdimensions.value end) within record as platform
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))
)
where UserIdd is not null
and platform = 'Android'
GROUP BY
1,
2,
3,
4,
5
ORDER BY
6 DESC
and here is my query with FLATTEN (same issue - numbers dont make sense)
SELECT
date,
hits.appInfo.version,
customDimensions.index,
customDimensions.value,
hits.appInfo.screenName,
UserIdd,
count(hits.appInfo.screenName)
FROM (FLATTEN(( FLATTEN((
SELECT
date,
hits.appInfo.version,
customDimensions.value,
customDimensions.index,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))), customDimensions.value)),hits.type))
WHERE
customDimensions.value = 'Android'
and customDimensions.index = 20
and UserIdd is not null
GROUP BY
1,
2,
3,
4,
5,
6
ORDER BY
7 DESC
I'm not positive that hits.customDimensions.* will always have the user-scoped dimensions (and I'm guessing your userId metric is user-scoped).
Specifically, user-scoped dimensions should be queried from customDimensions, not hits.customDimensions.
Notionally, the first step is to make customDimensions compatible with hits.* via flattening or scoped aggregation. I'll explain the flattening approach.
GA records have the shape (customDimensions[], hits[], ...), which is no good for querying both fields. We begin by flattening these to (customDimensionN, hits[], ...).
One level up, by selecting fields under hits.*, we implicitly flatten the table into (customDimensionN, hitN) records. We filter these to include only the records matching (customDimension5, appviewN).
The last step is to count everything up.
SELECT date, v, sn, uid, COUNT(*)
FROM (
SELECT
date,
hits.appInfo.version v,
hits.appInfo.screenName sn,
customDimensions.value uid
FROM
FLATTEN((
SELECT customDimensions.*, hits.*, date
FROM
TABLE_DATE_RANGE(
[fiery-cabinet-97820:87025718.ga_sessions_],
TIMESTAMP('2017-04-04'),
TIMESTAMP('2017-04-04'))),
customDimensions)
WHERE hits.type = "APPVIEW" and customDimensions.index = 5)
GROUP BY 1,2,3,4
ORDER BY 5 DESC
Here's another equivalent approach. This uses the scoped aggregation trick that I've seen recommended in the GA BQ cookbook. Looking at the query explanation, however, the MAX(IF(...)) WITHIN RECORD seems to be quite expensive, triggering an extra COMPUTE and AGGREGATE phase in the first stage. Bonus points for being a bit more digestible, though.
SELECT sn, uid, date, v, COUNT(*)
FROM (
SELECT
MAX(IF(customDimensions.index = 5, customDimensions.value, null)) within record as uid,
hits.appInfo.screenname as sn,
date,
hits.appInfo.version as v,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04')))
WHERE hits.type = "APPVIEW" and uid is not null
GROUP BY 1,2,3,4
ORDER BY 5 DESC
I'm not yet familiar with the Standard SQL dialect of BQ, but it seems that it would simplify this kind of wrangling. You might want to wrap your head around that if you'll be making many queries like this.

How to stop a row from show on Null count in query Access 2010

Here is the problem. I have a table and the table is 100% completed. There are no null values in the table. The table is broken down to:
Division > Region > RAPM > Status > Disposition
I want to count how many times "Training" is in [Disposition] for a [Region] using a query.
The error I get is when a [Region] has 0 "Training" in [Disposition] the count is a coming back as Null so the entire row is not shown.
How do i get the count to come back as "0" so I can keep the [Division], [Region], & [RAPM] in the results for reporting even if there is 0 count for training.
I have tried NZ() but this will not work because there is technically no Null cell to be converted.
Here is the statement:
SELECT tblAlignment.Division, tblAlignment.Region,tblAlignment.RAPM, Count(tblCase.Dispostion) AS CountofTraining
FROM tblCase INNER JOIN tblAlignment ON (tblCase.Region = tblAlignment.Region) AND (tblCase.Store = tblAlignment.[Store Number])
Where (((tblCase.Status)="Closed") AND ((tblCase.Disposition)="Training")
Group BY tblAlignment.Division, tblAlignment.Region, tblAlignment.RAPM
HAVING (((tblAlignment.Division)=[Forms]![frmDashboardNative]![NavigationSubform].[Form]![NavigationSubform].[Form]![Combo16]))
This is not the complete answer, but my reputation is not high enough to comment so this is the only way I can respond. To get a fuller answer you would need to provide more detailed table structure including primary keys and relationships between the tables. Making guesses from what you have provided I can make a couple of suggestions, but your post raises a few questions:
You say you want to count related entries based on same Region, but your join links on Region and Site. Is there a relationship between Site and Region? Can a single site only ever appear in one Region? If so then I think this information should possibly be in a separate table.
I think the HAVING condition should actually be in the WHERE clause.
I think you may have duplicated the ![NavigationSubform].[Form]
Anyway, a slightly more generic example of one way of achieving what you're after would be:
SELECT a.Region, Nz(b.RecordCount, 0) AS FinalCount
FROM TableA a
LEFT JOIN (SELECT Region, Count(*) AS RecordCount
FROM TableB
WHERE Status = "Closed" AND Disposition = "Training"
GROUP BY Region) AS b ON a.Region = b.Region
WHERE a.Division = [Combo16]
Hope this is of some help.

Sessions by hits.page.pagePath in GA bigquery tables

I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a

SQL sorting , paging, filtering best practices in ASP.NET

I am wondering how Google does it. I have a lot of slow queries when it comes to page count and total number of results. Google returns a count value of 250,000,00 in a fraction of a second.
I am dealing with grid views. I have built a custom pager for a gridview that requires an SQL query to return a page count based on the filters set by the user. The filters are at least 5 which includes a keyword, a category and subcategory, a date range filter, and a sort expression filter for sorting. The query contains about 10 massive table left joins.
This query is executed every time a search is performed and a query execution last an average of 30 seconds - be it count or a select. I believe what's making it slow is my query string of inclusive and exclusive date range filters. I have replaced (<=,>=) to BETWEEN and AND but still I experience the same problem.
See the query here:
http://friendpaste.com/4G2uZexRfhd3sSVROqjZEc
I have problems with a long date range parameter.
Check my table that contains the dates:
http://friendpaste.com/1HrC0L62hFR4DghE6ypIRp
UPDATE [9/17/2010] I minimized my date query and removed the time.
I tried reducing the joins for my count query (I am actually having a problem with my filter count which takes to long to return a result of 60k rows).
SELECT COUNT(DISTINCT esched.course_id)
FROM courses c
LEFT JOIN events_schedule esched
ON c.course_id = esched.course_id
LEFT JOIN course_categories cc
ON cc.course_id = c.course_id
LEFT JOIN categories cat
ON cat.category_id = cc.category_id
WHERE 1 = 1
AND c.course_type = 1
AND active = 1
AND c.country_id = 52
AND c.course_title LIKE '%cook%'
AND cat.main_category_id = 40
AND cat.category_id = 360
AND (
(2010-09-01' <= esched.date_start OR 2010-09-01' <= esched.date_end)
AND
('2010-09-25' >= esched.date_start OR '2010-09-25' >= esched.date_end)
)
I just noticed that my query is quite fast when I have a filter on my main or sub category fields. However when I only have a date filter and the range is a month or a week it needs to count a lot of rows and is done in 30seconds in average.
These are the static fields:
AND c.course_type = 1
AND active = 1
AND c.country_id = 52
UPDATE [9/17/2010] If a create a hash for these three fields and store it on one field will it do a change in speed?
These are my dynamic fields:
AND c.course_title LIKE '%cook%'
AND cat.main_category_id = 40
AND cat.category_id = 360
// ?DateStart and ?DateEnd
UPDATE [9/17/2010]. Now my problem is the leading % in LIKE query
Will post an updated explain
Search engines like Google use very complex behind-the-scenes algorythyms to index searches. Essentially, they have already determined which words occur on each page as well as the relative importance of those words and the relative importance of the pages (relative to other pages). These indexes are very quick because they are based on Bitwise Indexing.
Consider the following google searches:
custom : 542 million google hits
pager : 10.8 m
custom pager 1.26 m
Essentially what they have done is created a record for the word custom and in that record they have placed a 1 for every page that contains it and a 0 for every page that doesn't contain it. Then they zip it up because there are a lot more 0s than 1s. They do the same for pager.
When the search custom pager comes in, they unzip both records, perform a bitwise AND on them and this results in an array of bits where length is the total number of pages that they have indexed and the number of 1s represents the hit count for the search. The position of each bit corresponds to a particular result which is known in advance and they only have to look up the full details of the first 10 to display on the first page.
This is oversimplified, but that is the general principle.
Oh yes, they also have huge banks of servers performing the indexing and huge banks of servers responding to search requests. HUGE banks of servers!
This makes them a lot quicker than anything that could be done in a relational database.
Now, to your question: Could you paste some sample SQL for us to look at?
One thing you could try is changing the order that the tables and joins appear in your SQl statement. I know that it seems that it shouldn't make a difference but it certainly can. If you put the most restrictive joins earlier in the statement then you could well end up with fewer overall joins performed within the database.
A real world example. Say you wanted to find all of the entries in the phonebook under the name 'Johnson', with the number beginning with '7'. One way would be to look for all the numbers beginning with 7 and then join that with the numbers belonging to people called 'Johnson'. In fact it would be far quicker to perform the filtering the other way around even if you had indexing on both names and numbers. This is because the name 'Johnson' is more restrictive than the number 7.
So order does count, and datbase software is not always good at determining in advance which joins to perform first. I'm not sure about MySQL as my experience is mostly with SQL Server which uses index statistics to calculate which order to perform joins. These stats get out of date after a number of inserts, updates and deletes, so they have to be re-computed periodically. If MySQL has something similar, you could try this.
UPDATE
I have looked at the query that you posted. Ten left joins is not unusual and should perform fine as long as you have the right indexes in place. Yours is not a complicated query.
What you need to do is break this query down to its fundamentals. Comment out the lookup joins such as those to currency, course_stats, countries, states and cities along with the corresponding fields in the select statement. Does it still run as slowly? Probably not. But it is probably still not ideal.
So comment out all of the rest until you just have the courses and the group by course id and order by courseid. Then, experiment with adding in the left joins to see which one has the greatest impact. Then, focusing on the ones with the greatest impact on performance, change the order of the queries. This is the trial - and - error approach,. It would be a lot better for you to take a look at the indexes on the columns that you are joining on.
For example, the line cm.method_id = c.method_id would require a primary key on course_methodologies.method_id and a foreign key index on courses.method_id and so on. Also, all of the fields in the where, group by and order by clauses need indexes.
Good luck
UPDATE 2
You seriously need to look at the date filtering on this query. What are you trying to do?
AND ((('2010-09-01 00:00:00' <= esched.date_start
AND esched.date_start <= '2010-09-25 00:00:00')
OR ('2010-09-01 00:00:00' <= esched.date_end
AND esched.date_end <= '2010-09-25 00:00:00'))
OR ((esched.date_start <= '2010-09-01 00:00:00'
AND '2010-09-01 00:00:00' <= esched.date_end)
OR (esched.date_start <= '2010-09-25 00:00:00'
AND '2010-09-25 00:00:00' <= esched.date_end)))
Can be re-written as:
AND (
//date_start is between range - fine
(esched.date_start BETWEEN '2010-09-01 00:00:00' AND '2010-09-25 00:00:00')
//date_end is between range - fine
OR (esched.date_end BETWEEN '2010-09-01 00:00:00' AND '2010-09-25 00:00:00')
OR (esched.date_start <= '2010-09-01 00:00:00' AND esched.date_end >= '2010-09-01 00:00:00' )
OR (esched.date_start <= '2010-09-25 00:00:00' AND esched.date_end > = '2010-09-25 00:00:00')
)
on your update you mention you suspect the problem to be in the date filters.
All those date checks can be summed up in a single check:
esched.date_ends >= '2010-09-01 00:00:00' and esched.date_start <= '2010-09-25 00:00:00'
If with the above it behaves the same, check if the following returns quickly / is picking your indexes:
SELECT COUNT(DISTINCT esched.course_id)
FROM events_schedule esched
WHERE esched.date_ends >= '2010-09-01 00:00:00' and esched.date_start <= '2010-09-25 00:00:00'
ps I think that when using the join, you can do SELECT COUNT(c.course_id) to count main records of courses in the query directly i.e. might not need the distinct that way.
re update now most time going to the wild card search after the change:
Use a mysql full text search. Make sure to check fulltext-restrictions, one important is that its only supported in MyISAM tables. I must say that I haven't really used the mysql full text search, and I'm not sure how that impacts the use of other indexes in the query.
If you can't use a full text search, imho you are out luck in using your current approach to it i.e. since it can't use the regular index to check if a word its contained in any part of the text.
If that's the case, you might want to switch that specific part of the approach and introduce a tag/keywords based approach. Unlike categories, you can assign multiple to each item, so its flexible yet doesn't have the free text issue.

Does a multi-column index work for single column selects too?

I've got (for example) an index:
CREATE INDEX someIndex ON orders (customer, date);
Does this index only accelerate queries where customer and date are used or does it accelerate queries for a single-column like this too?
SELECT * FROM orders WHERE customer > 33;
I'm using SQLite.
If the answer is yes, why is it possible to create more than one index per table?
Yet another question: How much faster is a combined index compared with two separat indexes when you use both columns in a query?
marc_s has the correct answer to your first question. The first key in a multi key index can work just like a single key index but any subsequent keys will not.
As for how much faster the composite index is depends on your data and how you structure your index and query, but it is usually significant. The indexes essentially allow Sqlite to do a binary search on the fields.
Using the example you gave if you ran the query:
SELECT * from orders where customer > 33 && date > 99
Sqlite would first get all results using a binary search on the entire table where customer > 33. Then it would do a binary search on only those results looking for date > 99.
If you did the same query with two separate indexes on customer and date, Sqlite would have to binary search the whole table twice, first for the customer and again for the date.
So how much of a speed increase you will see depends on how you structure your index with regard to your query. Ideally, the first field in your index and your query should be the one that eliminates the most possible matches as that will give the greatest speed increase by greatly reducing the amount of work the second search has to do.
For more information see this:
http://www.sqlite.org/optoverview.html
I'm pretty sure this will work, yes - it does in MS SQL Server anyway.
However, this index doesn't help you if you need to select on just the date, e.g. a date range. In that case, you might need to create a second index on just the date to make those queries more efficient.
Marc
I commonly use combined indexes to sort through data I wish to paginate or request "streamily".
Assuming a customer can make more than one order.. and customers 0 through 11 exist and there are several orders per customer all inserted in random order. I want to sort a query based on customer number followed by the date. You should sort the id field as well last to split sets where a customer has several identical dates (even if that may never happen).
sqlite> CREATE INDEX customer_asc_date_asc_index_asc ON orders
(customer ASC, date ASC, id ASC);
Get page 1 of a sorted query (limited to 10 items):
sqlite> SELECT id, customer, date FROM orders
ORDER BY customer ASC, date ASC, id ASC LIMIT 10;
2653|1|1303828585
2520|1|1303828713
2583|1|1303829785
1828|1|1303830446
1756|1|1303830540
1761|1|1303831506
2442|1|1303831705
2523|1|1303833761
2160|1|1303835195
2645|1|1303837524
Get the next page:
sqlite> SELECT id, customer, date FROM orders WHERE
(customer = 1 AND date = 1303837524 and id > 2645) OR
(customer = 1 AND date > 1303837524) OR
(customer > 1)
ORDER BY customer ASC, date ASC, id ASC LIMIT 10;
2515|1|1303837914
2370|1|1303839573
1898|1|1303840317
1546|1|1303842312
1889|1|1303843243
2439|1|1303843699
2167|1|1303849376
1544|1|1303850494
2247|1|1303850869
2108|1|1303853285
And so on...
Having the indexes in place reduces server side index scanning when you would otherwise use a query OFFSET coupled with a LIMIT. The query time gets longer and the drives seek harder the higher the offset goes. Using this method eliminates that.
Using this method is advised if you plan on joining data later but only need a limited set of data per request. Join against a SUBSELECT as described above to reduce memory overhead for large tables.

Resources