BigQuery: unexpected result after filtering table - google-analytics

Given the following query (very simplified):
SELECT hits.page.pagepath AS Page
FROM
`[projectid].[datasetid].ga_sessions_*` t, t.hits as hits
WHERE
_TABLE_SUFFIX BETWEEN '20190123' AND '20190123'
AND (SELECT COUNT(*)>0 FROM t.hits WHERE REGEXP_CONTAINS(hits.page.pagepath,r'dames'))
I expected that this query only returns pages which contain 'dames', but this is actually not the case. With this filter in the WHERE section..
(SELECT COUNT(*)>0 FROM t.hits WHERE REGEXP_CONTAINS(hits.page.pagepath,r'dames'))
..there is flattened on hit-level and filtered on only pages of dames. In the main query there is also flattened on hit-level. So I would expect that per hit there would be TRUE's and FALSE's where only TRUE's remain in the final dataset, namely only pages that contain 'dames'.
I know queries that do return the expected output, but my main question (purely to understand why this query is not working) is actually more: why does this query not work as expected?
Thanks in advance!

You must understand, that cross-joining an unnested array with its parent row does not exactly flatten the source table. It repeats the parent row for every row in the array: in this case every session information gets repeated for every hit: the hits-array itself too!
That means for every hit you could lookup stuff in the whole session, because for every hit there are all hits available, because they too got repeated.
You are accessing this repeated hits array in your WHERE clause.
Instead of writing a sub-select on this repeated array, you want to use the newly available cross-joined fields from that array, i.e. AND REGEXP_CONTAINS(hits.page.pagepath,r'dames')
It might be a bit confusing in your case, because your alias for the flattened hits is hits as well - you might want to consider renaming it to something different like h so your NOT working query looks like this
SELECT h.page.pagepath AS Page
FROM
`[projectid].[datasetid].ga_sessions_*` t, t.hits as h
WHERE
_TABLE_SUFFIX BETWEEN '20190123' AND '20190123'
AND (SELECT COUNT(*)>0 FROM t.hits h2 WHERE REGEXP_CONTAINS(h.page.pagepath,r'dames'))
You are checking for every page whether the whole session contained a page fulfilling your condition.
The WORKING example would be
SELECT h.page.pagepath AS Page
FROM
`[projectid].[datasetid].ga_sessions_*` t, t.hits as h
WHERE
_TABLE_SUFFIX BETWEEN '20190123' AND '20190123'
AND REGEXP_CONTAINS(h.page.pagepath,r'dames')

Related

Big Query and Google Analytics UI do not match when ecommerce action filter applied

We are validating a query in Big Query, and cannot get the results to match with the google analytics UI. A similar question can be found here, but in our case the the mismatch only occurs when we apply a specific filter on ecommerce_action.action_type.
Here is the query:
SELECT COUNT(distinct fullVisitorId+cast(visitid as string)) AS sessions
FROM (
SELECT
device.browserVersion,
geoNetwork.networkLocation,
geoNetwork.networkDomain,
geoNetwork.city,
geoNetwork.country,
geoNetwork.continent,
geoNetwork.region,
device.browserSize,
visitNumber,
trafficSource.source,
trafficSource.medium,
fullvisitorId,
visitId,
device.screenResolution,
device.flashVersion,
device.operatingSystem,
device.browser,
totals.pageviews,
channelGrouping,
totals.transactionRevenue,
totals.timeOnSite,
totals.newVisits,
totals.visits,
date,
hits.eCommerceAction.action_type
FROM
(select *
from TABLE_DATE_RANGE([zzzzzzzzz.ga_sessions_],
<range>) ))t
WHERE
hits.eCommerceAction.action_type = '2' and <stuff to remove bots>
)
From the UI using the built in shopping behavior report, we get 3.836M unique sessions with a product detail view, compared with 3.684M unique sessions in Big Query using the query above.
A few questions:
1) We are under the impression the shopping behavior report "Sessions with Product View" breakdown is based off of the ecommerce_action.actiontype filter. Is that true?
2) Is there a .totals pre-aggregated table that the UI maybe pulling from?
It sounds like the issue is that COUNT(DISTINCT ...) is approximate when using legacy SQL, as noted in the migration guide, so the counts are not accurate. Either use standard SQL instead (preferred) or use EXACT_COUNT_DISTINCT with legacy SQL.
You're including product list views in your query.
As described in https://support.google.com/analytics/answer/3437719 you need to make sure, that no product has isImpression = TRUE because that would mean it is a product list view.
This query sums all sessions which contain any action_type='2' for which all isProduct are null or false:
SELECT
SUM(totals.visits) AS sessions
FROM
`project.123456789.ga_sessions_20180101` AS t
WHERE
(
SELECT
LOGICAL_OR(h.ecommerceaction.action_type='2')
FROM
t.hits AS h
WHERE
(SELECT LOGICAL_AND(isimpression IS NULL OR isimpression = FALSE) FROM h.product))
For legacySQL you can adapt the example in the documentation.
In addition to the fact that COUNT(DISTINCT ...) is approximate when using legacy SQL, there could be sessions in which there are only non-interactive hits, which will not be counted as sessions in the Google Analytics UI but they are counted by both COUNT(DISTINCT ...) and EXACT_COUNT_DISTINCT(...) because in your query they count visit id's.
Using SUM(totals.visits) you should get the same result as in the UI because SUM does not take into account NULL values of totals.visits (corresponding to sessions in which there are only non-interactive hits).

BigQuery: two hitlevel custom dimensions

I can't seem to get a query that gives me all sessions in which customdimensionX has value X and customdimensionY has value Y within the same hit. The query I currently have results in no results found.
Can anybody help me on this:)?
Thanks!
SELECT sum(totals.visits)
from TABLE_DATE_RANGE([xxxx.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
WHERE
(hits.customDimensions.index=x AND hits.customDimensions.value='x')
AND (hits.customDimensions.index=y AND hits.customDimensions.value='y')
Bit strange to answer my own question but it might be useful for someone else:) I got to the right number in the following way:
SELECT EXACT_COUNT_DISTINCT(uniqueVisitId) as sessions
FROM(
SELECT
CONCAT(fullvisitorid,"_",string(visitId)) AS uniqueVisitId,
MAX(IF(hits.customDimensions.index=x,hits.customDimensions.value,NULL)) WITHIN hits AS x,
MAX(IF(hits.customDimensions.index=y,hits.customDimensions.value,NULL)) WITHIN hits AS y,
hits.hitNumber
FROM TABLE_DATE_RANGE([xxxxxx.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
having
(x contains 'x' and y contains 'y')
)
Try below options (don't have chance to test, but should be close to what you need, if not exactly):
SELECT SUM(totals.visits)
FROM TABLE_DATE_RANGE([66080915.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
OMIT RECORD IF
SUM((hits.customDimensions.index=x AND hits.customDimensions.value='x')
OR (hits.customDimensions.index=y AND hits.customDimensions.value='y')
) != 2
SELECT SUM(totals.visits) FROM (
SELECT totals.visits,
SUM((hits.customDimensions.index=x AND hits.customDimensions.value='x')
OR (hits.customDimensions.index=y AND hits.customDimensions.value='y')
) WITHIN RECORD AS check,
FROM TABLE_DATE_RANGE([66080915.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
HAVING check = 2
)
ADDED
If customDimensions where groupped by specific hits like hits.hit.customVariables - you would be able to identify both conditions within the same hit by using
WITHIN hits.hit or OMIT hits.hit IF
vs. respectively
WITHIN RECORD or OMIT RECORD IF
But I've checked BigQuery Export schema and it seems not a case.
I dont see way to distinguish dimensions per specific hit.
Custom Dimensions are presented by level - user/session level, product level and hits level.
Only product level custom dimentions can be identifyed/queryed per product.
Hope this helps

Sessions by hits.page.pagePath in GA bigquery tables

I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!
Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)
Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a

Poor SP performance from ASP.NET

I have a stored procedure that handles sorting, filtering and paging (using Row_Number) and some funky trickery :) The SP is running against a table with ~140k rows.
The whole thing works great and for at least the first few dozen pages is super quick.
However, if I try to navigate to higher pages (e.g. head to the last page of 10k) the whole thing comes to a grinding halt and results in a SQL timeout error.
If I run the same query, using the same parms inside studio manager query window, the response is instant irrespective of the page number I pass in.
At the moment it's test code that is simply binding to a ASP:Datagrid in .NET 3.5
The SP looks like this:
BEGIN
WITH Keys
AS (
SELECT
TOP (#PageNumber * #PageSize) ROW_NUMBER() OVER (ORDER BY JobNumber DESC) as rn
,P1.jobNumber
,P1.CustID
,P1.DateIn
,P1.DateDue
,P1.DateOut
FROM vw_Jobs_List P1
WHERE
(#CustomerID = 0 OR CustID = #CustomerID) AND
(JobNumber LIKE '%'+#FilterExpression+'%'
OR OrderNumber LIKE '%'+#FilterExpression+'%'
OR [Description] LIKE '%'+#FilterExpression+'%'
OR Client LIKE '%'+#FilterExpression+'%')
ORDER BY P1.JobNumber DESC ),SelectedKeys
AS (
SELECT
TOP (#PageSize)SK.rn
,SK.JobNumber
,SK.CustID
,SK.DateIn
,SK.DateDue
,SK.DateOut
FROM Keys SK
WHERE SK.rn > ((#PageNumber-1) * #PageSize)
ORDER BY SK.JobNumber DESC)
SELECT
SK.rn
,J.JobNumber
,J.Description
,J.Client
,SK.CustID
,OrderNumber
,CAST(DateAdd(d, -2, CAST(isnull(SK.DateIn,0) AS DateTime)) AS nvarchar) AS DateIn
,CAST(DateAdd(d, -2, CAST(isnull(SK.DateDue,0) AS DateTime)) AS nvarchar) AS DateDue
,CAST(DateAdd(d, -2, CAST(isnull(SK.DateOut,0) AS DateTime)) AS nvarchar) AS DateOut
,Del_Method
,Ticket#
,InvoiceEmailed
,InvoicePrinted
,InvoiceExported
,InvoiceComplete
,JobStatus
FROM SelectedKeys SK
JOIN vw_Jobs_List J ON j.JobNumber=SK.JobNumber
ORDER BY SK.JobNumber DESC
END
And it's called via
sp_jobs (PageNumber,PageSize,FilterExpression,OrderBy,CustomerID)
e.g.
sp_Jobs '13702','10','','JobNumberDESC','0'
Can anyone shed any light on what might be the cause of the dramatic difference in performance between SQL query window and an asp.net page executing a dataset?
Check out the "WITH RECOMPILE" option
http://www.techrepublic.com/article/understanding-sql-servers-with-recompile-option/5662581
I have run into similar problems where the execution plan on stored procedures will work great for a while, but then get a new plan because the options changed. So, it will be "optimized" for one case and then perform "table scans" for another option. Here is what I have tried in the past:
Re-execute the stored procedure to calculate a new execution plan and then keep an eye on it.
Break up the stored procedure into separate stored procedures of each option such that it can be optimized and then the overall stored procedure simply calls each "optimized" stored procedure.
Bring in the records into an object and then perform all of the "funky trickery" in code and then it gives you the option to "cache" the results.
Obviously option #2 and #3 is better than option #1. I am honestly finding option #3 is becoming the best bet in most cases.
I just had another option 4. You could instead of performing your "inner selects" in one query, you could put the results of your inner selects into temporary tables and then JOIN on those results. I would still push for option #3 if possible, but I understand that sometimes you just need to keep working the stored procedure until it "works".
Good luck.

ASP.NET, SQL 2005 "paging"

This is a followup on the question:
ASP.NET next/previous buttons to display single row in a form
As it says on the page above, theres a previous/next button on the page, that retrieves a single row one at a time.
Totally there's ~500,000 rows.
When I "page" through each subscribtion number, the form gets filled with subscriber details. What approach should I use on the SQL server?
Using the ROW_NUMBER() function seems a bit overkill as it has to number all ~500.000 rows (I guess?), so what other possible solutions are there?
Thanks in advance!
ROW_NUMBER() is probably your best choice.
From this MSDN article: http://msdn.microsoft.com/en-us/library/ms186734.aspx
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS 'RowNumber'
FROM Sales.SalesOrderHeader
)
SELECT *
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;
And just subsititute 50 and 60 with a parameter for the row number you want.
Tommy, if your user has time to page through 500,000 rows at one page per row, then he/she is unique.
I guess what I am saying here is that you may be able to provide a better UX. When - Too many pages? Build a search feature.
There are two potential workarounds (for this purpose, using a start of 201, pages of 100):
SQL
SELECT TOP 100 * FROM MyTable WHERE ID > 200 ORDER BY ID
LINQ to SQL
var MyRows = (from t in db.Table
order by t.ID ascending
select t).Skip(200).Take(100)
If your ID field has a clustered index, use the former. If not, both of these will take the same amount of time (LINQ returns 500,000 rows, then skips, then takes).
If you're sorting by something that's NOT ID and you have it indexed, use ROW_NUMBER().
Edit: Because the OP isn't sorting by ID, the only solution is ROW_NUMBER(), which is the clause that I put at the end there.
In this case, the table isn't indexed, so please see here for ideas on how to index to improve query performance.

Resources