getting page views by two custom dimensions of different granularity in bigquery - google-analytics

I'm trying to pull a report from bigquery where I can see pageviews segmented by day and couple of custom dimensions (one at hit level and the other at session level) with this query:
SELECT
date
,SUM(totals.pageviews) as PVs
,MAX(IF(hits.customDimensions.index = 11, hits.customDimensions.value,NULL)) AS x
,MAX(IF(customDimensions.index = 1, customDimensions.value,NULL)) AS y
FROM TABLE_DATE_RANGE([111111111.ga_sessions_]
,TIMESTAMP('2016-10-01')
,TIMESTAMP('2016-10-31'))
GROUP EACH BY 1
I get the following:
Error: Cannot query the cross product of repeated fields customDimensions.index and hits.page.pagePath.
I've been looking at other answers but didn't find anything addressing a similar enough issue. Could you suggest a better query?
Thanks!

you need to flatten your data
take a look at Google's example reporting "Cannot query the cross product of repeated fields children.age and citiesLived.yearsLived" within Dealing with data
"To query across more than one repeated field, you need to flatten one of the fields:
SELECT
fullName,
age,
gender,
citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
(citiesLived.yearsLived > 1995) AND
(children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place"
to get around the table_date_range limitation, try creating a sub select first
SELECT
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
customDimensions.value
FROM
FLATTEN((
SELECT
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
customDimensions.index,
customDimensions.value
FROM (TABLE_DATE_RANGE([dataset.table_], DATE_ADD(CURRENT_TIMESTAMP(), -3, 'DAY'), DATE_ADD(CURRENT_TIMESTAMP(), -1, 'DAY')))),hits.eventInfo.eventCategory)
as discussed on Official Google BigQuery issue and feature request tracker

Related

Big Query and Google Analytics UI do not match when ecommerce action filter applied

We are validating a query in Big Query, and cannot get the results to match with the google analytics UI. A similar question can be found here, but in our case the the mismatch only occurs when we apply a specific filter on ecommerce_action.action_type.
Here is the query:
SELECT COUNT(distinct fullVisitorId+cast(visitid as string)) AS sessions
FROM (
SELECT
device.browserVersion,
geoNetwork.networkLocation,
geoNetwork.networkDomain,
geoNetwork.city,
geoNetwork.country,
geoNetwork.continent,
geoNetwork.region,
device.browserSize,
visitNumber,
trafficSource.source,
trafficSource.medium,
fullvisitorId,
visitId,
device.screenResolution,
device.flashVersion,
device.operatingSystem,
device.browser,
totals.pageviews,
channelGrouping,
totals.transactionRevenue,
totals.timeOnSite,
totals.newVisits,
totals.visits,
date,
hits.eCommerceAction.action_type
FROM
(select *
from TABLE_DATE_RANGE([zzzzzzzzz.ga_sessions_],
<range>) ))t
WHERE
hits.eCommerceAction.action_type = '2' and <stuff to remove bots>
)
From the UI using the built in shopping behavior report, we get 3.836M unique sessions with a product detail view, compared with 3.684M unique sessions in Big Query using the query above.
A few questions:
1) We are under the impression the shopping behavior report "Sessions with Product View" breakdown is based off of the ecommerce_action.actiontype filter. Is that true?
2) Is there a .totals pre-aggregated table that the UI maybe pulling from?
It sounds like the issue is that COUNT(DISTINCT ...) is approximate when using legacy SQL, as noted in the migration guide, so the counts are not accurate. Either use standard SQL instead (preferred) or use EXACT_COUNT_DISTINCT with legacy SQL.
You're including product list views in your query.
As described in https://support.google.com/analytics/answer/3437719 you need to make sure, that no product has isImpression = TRUE because that would mean it is a product list view.
This query sums all sessions which contain any action_type='2' for which all isProduct are null or false:
SELECT
SUM(totals.visits) AS sessions
FROM
`project.123456789.ga_sessions_20180101` AS t
WHERE
(
SELECT
LOGICAL_OR(h.ecommerceaction.action_type='2')
FROM
t.hits AS h
WHERE
(SELECT LOGICAL_AND(isimpression IS NULL OR isimpression = FALSE) FROM h.product))
For legacySQL you can adapt the example in the documentation.
In addition to the fact that COUNT(DISTINCT ...) is approximate when using legacy SQL, there could be sessions in which there are only non-interactive hits, which will not be counted as sessions in the Google Analytics UI but they are counted by both COUNT(DISTINCT ...) and EXACT_COUNT_DISTINCT(...) because in your query they count visit id's.
Using SUM(totals.visits) you should get the same result as in the UI because SUM does not take into account NULL values of totals.visits (corresponding to sessions in which there are only non-interactive hits).

Custom dimensions in google BigQuery

I'm using big query and am trying to import custom dimensions along with noncustom dimensions. The analytics is sent from an app and basically I want a table with columns: UserID (custom dimension), platformID (custom dimension), ScreenName (basically app version of "Page name"), and date. The metric is "number of screenviews" grouped onto all of these dimensions. This is what it looks like below:
The photo of the GA report:
So, in bigquery, I could get numbers that checked out (when compared to GA report above) until I added in custom dimensions. Once I added custom dimensions, the numbers no longer made any sense.
I know that custom dimensions are nested within big query. So I made sure to use FLATTEN at first. Then I tried without flatten and got same results. The numbers make no sense (are hundreds of times larger than in GA interface).
My queries are below (one without FLATTEN and one with FLATTEN).
ps I ideally wanted to use
count(hits)
instead of
count(hits.appInfo.screenName)
But I kept getting an error when I selected hits in my subquery.
My query without flatten is below. If you could help me figure out why is it that once I add custom dimensions all data gets messed up
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
UserIdd,
platform,
count(hits.appInfo.screenName)
FROM (
SELECT
date,
hits.appInfo.version,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
max(case when hits.customdimensions.index = 20 then hits.customdimensions.value end) within record as platform
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))
)
where UserIdd is not null
and platform = 'Android'
GROUP BY
1,
2,
3,
4,
5
ORDER BY
6 DESC
and here is my query with FLATTEN (same issue - numbers dont make sense)
SELECT
date,
hits.appInfo.version,
customDimensions.index,
customDimensions.value,
hits.appInfo.screenName,
UserIdd,
count(hits.appInfo.screenName)
FROM (FLATTEN(( FLATTEN((
SELECT
date,
hits.appInfo.version,
customDimensions.value,
customDimensions.index,
hits.appInfo.screenName,
max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))), customDimensions.value)),hits.type))
WHERE
customDimensions.value = 'Android'
and customDimensions.index = 20
and UserIdd is not null
GROUP BY
1,
2,
3,
4,
5,
6
ORDER BY
7 DESC
I'm not positive that hits.customDimensions.* will always have the user-scoped dimensions (and I'm guessing your userId metric is user-scoped).
Specifically, user-scoped dimensions should be queried from customDimensions, not hits.customDimensions.
Notionally, the first step is to make customDimensions compatible with hits.* via flattening or scoped aggregation. I'll explain the flattening approach.
GA records have the shape (customDimensions[], hits[], ...), which is no good for querying both fields. We begin by flattening these to (customDimensionN, hits[], ...).
One level up, by selecting fields under hits.*, we implicitly flatten the table into (customDimensionN, hitN) records. We filter these to include only the records matching (customDimension5, appviewN).
The last step is to count everything up.
SELECT date, v, sn, uid, COUNT(*)
FROM (
SELECT
date,
hits.appInfo.version v,
hits.appInfo.screenName sn,
customDimensions.value uid
FROM
FLATTEN((
SELECT customDimensions.*, hits.*, date
FROM
TABLE_DATE_RANGE(
[fiery-cabinet-97820:87025718.ga_sessions_],
TIMESTAMP('2017-04-04'),
TIMESTAMP('2017-04-04'))),
customDimensions)
WHERE hits.type = "APPVIEW" and customDimensions.index = 5)
GROUP BY 1,2,3,4
ORDER BY 5 DESC
Here's another equivalent approach. This uses the scoped aggregation trick that I've seen recommended in the GA BQ cookbook. Looking at the query explanation, however, the MAX(IF(...)) WITHIN RECORD seems to be quite expensive, triggering an extra COMPUTE and AGGREGATE phase in the first stage. Bonus points for being a bit more digestible, though.
SELECT sn, uid, date, v, COUNT(*)
FROM (
SELECT
MAX(IF(customDimensions.index = 5, customDimensions.value, null)) within record as uid,
hits.appInfo.screenname as sn,
date,
hits.appInfo.version as v,
hits.type
FROM
TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04')))
WHERE hits.type = "APPVIEW" and uid is not null
GROUP BY 1,2,3,4
ORDER BY 5 DESC
I'm not yet familiar with the Standard SQL dialect of BQ, but it seems that it would simplify this kind of wrangling. You might want to wrap your head around that if you'll be making many queries like this.

BigQuery: two hitlevel custom dimensions

I can't seem to get a query that gives me all sessions in which customdimensionX has value X and customdimensionY has value Y within the same hit. The query I currently have results in no results found.
Can anybody help me on this:)?
Thanks!
SELECT sum(totals.visits)
from TABLE_DATE_RANGE([xxxx.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
WHERE
(hits.customDimensions.index=x AND hits.customDimensions.value='x')
AND (hits.customDimensions.index=y AND hits.customDimensions.value='y')
Bit strange to answer my own question but it might be useful for someone else:) I got to the right number in the following way:
SELECT EXACT_COUNT_DISTINCT(uniqueVisitId) as sessions
FROM(
SELECT
CONCAT(fullvisitorid,"_",string(visitId)) AS uniqueVisitId,
MAX(IF(hits.customDimensions.index=x,hits.customDimensions.value,NULL)) WITHIN hits AS x,
MAX(IF(hits.customDimensions.index=y,hits.customDimensions.value,NULL)) WITHIN hits AS y,
hits.hitNumber
FROM TABLE_DATE_RANGE([xxxxxx.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
having
(x contains 'x' and y contains 'y')
)
Try below options (don't have chance to test, but should be close to what you need, if not exactly):
SELECT SUM(totals.visits)
FROM TABLE_DATE_RANGE([66080915.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
OMIT RECORD IF
SUM((hits.customDimensions.index=x AND hits.customDimensions.value='x')
OR (hits.customDimensions.index=y AND hits.customDimensions.value='y')
) != 2
SELECT SUM(totals.visits) FROM (
SELECT totals.visits,
SUM((hits.customDimensions.index=x AND hits.customDimensions.value='x')
OR (hits.customDimensions.index=y AND hits.customDimensions.value='y')
) WITHIN RECORD AS check,
FROM TABLE_DATE_RANGE([66080915.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
HAVING check = 2
)
ADDED
If customDimensions where groupped by specific hits like hits.hit.customVariables - you would be able to identify both conditions within the same hit by using
WITHIN hits.hit or OMIT hits.hit IF
vs. respectively
WITHIN RECORD or OMIT RECORD IF
But I've checked BigQuery Export schema and it seems not a case.
I dont see way to distinguish dimensions per specific hit.
Custom Dimensions are presented by level - user/session level, product level and hits level.
Only product level custom dimentions can be identifyed/queryed per product.
Hope this helps

Difference in statistics from Google Analytics Report and BigQuery Data in Hive table

I have a Google Analytics premium account set up to monitor the user activity of a website and mobile application.
Raw data from GA is being stored in BigQuery tables.
However, I noticed that the statistics that I see in a GA report are quite different the statistics that I see when querying the BigQuery tables.
I understand that GA reports show aggregated data and possibly, sampled data. And that the raw data in Bigquery tables is session/hit-level data.
But I am still not sure if I understand the reason why the statistics could be different.
Would really appreciate it if someone clarified this for me.
Thanks in advance.
UPDATE 1:
I exported the raw data from Bigquery into my Hadoop cluster. The data is stored in a hive table. I flattened all the nested and repeated fields before exporting.
Here is the hive query that I ran on the raw data in the Hive table:
SELECT
date as VisitDate,
count(distinct fullvisitorid) as CountVisitors,
SUM(totals_visits) as SumVisits,
SUM(totals_pageviews) AS PVs
FROM
bigquerydata
WHERE
fullvisitorid IS NOT NULL
GROUP BY
date
ORDER BY
VisitDate DESC
A) Taking February 9th as the VisitDate, I get the following results from this query:
i) CountVisitors= 1,074,323
ii) SumVisits= 48,990,198
iii) PVs= 1,122,841,424
Vs
B) Taking the same VisitDate and obtaining the same statistics from the GA report:
i) Users count = 1,549,757
ii) Number of pageviews = 11,604,449 (Huge difference when compared to A(iii))
In the hive query above, am I using any wrong fields or processing the fields in a wrong way? Just trying to figure out why I have this difference in numbers.
UPDATE 2 (following #Felipe Hoffa 's suggestion):
This is how I am flattening the tables in my Python code before exporting the result to GCS and then to Hadoop cluster:
queryString = 'SELECT * FROM flatten(flatten(flatten(flatten(flatten(flatten([' + TABLE_NAME + '],hits),hits.product),hits.promotion),hits.customVariables), hits.customDimensions), hits.customMetrics)'
I understand what you are saying about flattening causing repeated pageviews and each repetition getting into the final wrong addition.
I tried the same query (from Update1) on Bigquery table instead of my Hive table. The numbers matched with those on the Google Analytics Dashboard.
However, assuming that the Hive table is all I have and it has those repeated fields due to flattening.. BUT Is there still anyway that I can fix my hive query to match the stats from Google Analytics dashboard?
Logically speaking, if the repeated fields came up due to flattening.. can't I reverse the same thing in my Hive table? If you think that I can reverse, do you have any suggestion as to how I can proceed on it?
Thank you so much in advance!
Can you run the same query in BigQuery, instead of on the data exported to Hive?
My guess: "The data is stored in a hive table. I flattened all the nested and repeated fields before exporting." When flattening - are you repeating pageviews several times, with each repetition getting into the final wrong addition?
Note how data can get duplicated when flattening rows:
SELECT col, x FROM (
SELECT "wrong" col, SUM(totals.pageviews) x
FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
), (
SELECT "correct" col, SUM(totals.pageviews) x
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
col x
wrong 2262
correct 249
Update given "update 2" to the question:
Since BigQuery is working correctly, and this is a Hive problem, you should add that tag to get relevant answers.
Nevertheless, this is how I would correctly de-duplicate previously duplicated rows with BigQuery:
SELECT SUM(pv)
FROM (
SELECT visitId, MAX(totals.pageviews) pv
FROM (FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits))
GROUP EACH BY 1
)

Sessions by hits.page.pagePath in GA bigquery tables

I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a

Resources