Sessions by hits.page.pagePath in GA bigquery tables - google-analytics

I am new to bigquery, so sorry if this is a noob question! I am interested in breaking out sessions by page path or title. I understand one session can contain multiple paths/titles so the sum would be greater than total sessions. Essentially, I want to create a 'session id' and do a count distinct of sessionids where path like a or b.
It might actually be helpful to start at the very beginning and manually calculate total sessions. I tried to concatenate visit id and full visitor id to create a unique visit id, but apparently that is quite different from sessions. Can someone help enlighten me? Thanks!
I am working with our GA site data. Schema is the standard in GA exports.
DATA SAMPLE
Let's use an example out of the sample BigQuery (London Helmet) data:
There are 63 sessions in this day:
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
How many of those sessions are where hits.page.pagePath like /vests% or /helmets%? How many were vests only vs helmets only? Thanks!

Here is an example of how to calculate whether there were only helmets, or only vests or both helmets and vests or neither:
SELECT
visitID,
has_helmets AND has_vests AS both_helmets_and_vests,
has_helmets AND NOT has_vests AS helmets_only,
NOT has_helmets AND has_vests AS vests_only,
NOT has_helmets AND NOT has_vests AS neither_helmets_nor_vests
FROM (
SELECT
visitId,
SOME(hits.page.pagePath like '/helmets%') WITHIN RECORD AS has_helmets,
SOME(hits.page.pagePath like '/vests%') WITHIN RECORD AS has_vests,
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
)

Way 1, easier but you need to repeat on each field
Obviously you can do something like this :
SELECT count(*) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE hits.page.pagePath like '/helmets%'
And then have multiple queries for your own substrings (one with '/vests%', one with 'helmets%', etc).
Way 2, works fine, but not with repeated fields
If you want ONE query that'll just group by on the first part of the string, you can do something like that :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] ) group by a
When I do this, it returns me the following the 63 sessions, with a total count of 63 :).
Way 3, using a FLATTEN on the table to get each hit individually
Since the "hits" field is repeatable, you would need a FLATTEN in your query :
Select a, Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by a
The reason why you need to FLATTEN here is that the "hits" field is repeatable. If you don't flatten, it won't look into ALL the "hits" in your response. Adding "FLATTEN" will make you work off a sub-table where each hit is in its own row, so you can query on all of them.
If you want it by sessions instead of hits, (it'll be both), do something like :
Select b, a Count(*) FROM (SELECT FIRST(SPLIT(hits.page.pagePath, '/')) as a, visitID as b, FROM FLATTEN ([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] , hits)) group by b, a

Related

getting page views by two custom dimensions of different granularity in bigquery

I'm trying to pull a report from bigquery where I can see pageviews segmented by day and couple of custom dimensions (one at hit level and the other at session level) with this query:
SELECT
date
,SUM(totals.pageviews) as PVs
,MAX(IF(hits.customDimensions.index = 11, hits.customDimensions.value,NULL)) AS x
,MAX(IF(customDimensions.index = 1, customDimensions.value,NULL)) AS y
FROM TABLE_DATE_RANGE([111111111.ga_sessions_]
,TIMESTAMP('2016-10-01')
,TIMESTAMP('2016-10-31'))
GROUP EACH BY 1
I get the following:
Error: Cannot query the cross product of repeated fields customDimensions.index and hits.page.pagePath.
I've been looking at other answers but didn't find anything addressing a similar enough issue. Could you suggest a better query?
Thanks!
you need to flatten your data
take a look at Google's example reporting "Cannot query the cross product of repeated fields children.age and citiesLived.yearsLived" within Dealing with data
"To query across more than one repeated field, you need to flatten one of the fields:
SELECT
fullName,
age,
gender,
citiesLived.place
FROM (FLATTEN([dataset.tableId], children))
WHERE
(citiesLived.yearsLived > 1995) AND
(children.age > 3)
GROUP BY fullName, age, gender, citiesLived.place"
to get around the table_date_range limitation, try creating a sub select first
SELECT
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
customDimensions.value
FROM
FLATTEN((
SELECT
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
customDimensions.index,
customDimensions.value
FROM (TABLE_DATE_RANGE([dataset.table_], DATE_ADD(CURRENT_TIMESTAMP(), -3, 'DAY'), DATE_ADD(CURRENT_TIMESTAMP(), -1, 'DAY')))),hits.eventInfo.eventCategory)
as discussed on Official Google BigQuery issue and feature request tracker

BigQuery: two hitlevel custom dimensions

I can't seem to get a query that gives me all sessions in which customdimensionX has value X and customdimensionY has value Y within the same hit. The query I currently have results in no results found.
Can anybody help me on this:)?
Thanks!
SELECT sum(totals.visits)
from TABLE_DATE_RANGE([xxxx.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
WHERE
(hits.customDimensions.index=x AND hits.customDimensions.value='x')
AND (hits.customDimensions.index=y AND hits.customDimensions.value='y')
Bit strange to answer my own question but it might be useful for someone else:) I got to the right number in the following way:
SELECT EXACT_COUNT_DISTINCT(uniqueVisitId) as sessions
FROM(
SELECT
CONCAT(fullvisitorid,"_",string(visitId)) AS uniqueVisitId,
MAX(IF(hits.customDimensions.index=x,hits.customDimensions.value,NULL)) WITHIN hits AS x,
MAX(IF(hits.customDimensions.index=y,hits.customDimensions.value,NULL)) WITHIN hits AS y,
hits.hitNumber
FROM TABLE_DATE_RANGE([xxxxxx.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
having
(x contains 'x' and y contains 'y')
)
Try below options (don't have chance to test, but should be close to what you need, if not exactly):
SELECT SUM(totals.visits)
FROM TABLE_DATE_RANGE([66080915.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
OMIT RECORD IF
SUM((hits.customDimensions.index=x AND hits.customDimensions.value='x')
OR (hits.customDimensions.index=y AND hits.customDimensions.value='y')
) != 2
SELECT SUM(totals.visits) FROM (
SELECT totals.visits,
SUM((hits.customDimensions.index=x AND hits.customDimensions.value='x')
OR (hits.customDimensions.index=y AND hits.customDimensions.value='y')
) WITHIN RECORD AS check,
FROM TABLE_DATE_RANGE([66080915.ga_sessions_], TIMESTAMP('2016-3-1'),TIMESTAMP('2016-3-1'))
HAVING check = 2
)
ADDED
If customDimensions where groupped by specific hits like hits.hit.customVariables - you would be able to identify both conditions within the same hit by using
WITHIN hits.hit or OMIT hits.hit IF
vs. respectively
WITHIN RECORD or OMIT RECORD IF
But I've checked BigQuery Export schema and it seems not a case.
I dont see way to distinguish dimensions per specific hit.
Custom Dimensions are presented by level - user/session level, product level and hits level.
Only product level custom dimentions can be identifyed/queryed per product.
Hope this helps

SQLite: SELECT from grouped and ordered result

I'm new to SQL(ite), so i'm sorry if there is a simple answer i just were to stupid to find the right search terms for.
I got 2 tables: 1 for user information and another holding points a user achieved. It's a simple one to many relation (a user can achieve points multiple times).
table1 contains "userID" and "Username" ...
table2 contains "userID" and "Amount" ...
Now i wanted to get a highscore rank for a given username.
To get the highscore i did:
SELECT Username, SUM(Amount) AS total FROM table2 JOIN table1 USING (userID) GROUP BY Username ORDER BY total DESC
How could i select a single Username and get its position from the grouped and ordered result? I have no idea how a subselect would've to look like for my goal. Is it even possible in a single query?
You cannot calculate the position of the user without referencing the other data. SQLite does not have a ranking function which would be ideal for your user case, nor does it have a row number feature that would serve as an acceptable substitute.
I suppose the closest you could get would be to drop this data into a temp table that has an incrementing ID, but I think you'd get very messy there.
It's best to handle this within the application. Get all the users and calculate rank. Cache individual user results as necessary.
Without knowing anything more about the operating context of the app/DB it's hard to provide a more specific recommendation.
For a specific user, this query gets the total amount:
SELECT SUM(Amount)
FROM Table2
WHERE userID = ?
You have to count how many other users have a higher amount than that single user:
SELECT COUNT(*)
FROM table1
WHERE (SELECT SUM(Amount)
FROM Table2
WHERE userID = table1.userID)
>=
(SELECT SUM(Amount)
FROM Table2
WHERE userID = ?);

How to create database table dynamically and insert data selected by query

I'm working on website where I need to find rank of user on the basis of score. Earlier I'm calculating the score and rank of user by sql query .
select * from (
select
usrid,
ROW_NUMBER()
OVER(ORDER BY (count(*)+sum(sup)+sum(opp)+sum(visited)*0.3) DESC) AS rank,
(count(*)+sum(sup)+sum(opp)+sum(visited)*0.3 ) As score
from [DB_].[dbo].[dsas]
group by usrid) as cash
where usrid=#userid
Please don't concentrate more on query because this is only to explain how I select data.
Problem: Now I can't use above query because every time I use rank it need to select rank from dsas table and data of dsas table is increasing day by day and slows down my website.
What I need is select data by above query and insert in another table named as score. Can we do anything like this?
A better solution is to either include score as a field in your user table or have a separate table for scores. Any time you add new sup, opp, or visited data for a user, also recalculate their score at that time.
Then to get the highest ranking users, you will be able to perform a very simple select statement, ordering by score descending, and only fetching the number of rows you want. It will be very fast.

ASP.NET, SQL 2005 "paging"

This is a followup on the question:
ASP.NET next/previous buttons to display single row in a form
As it says on the page above, theres a previous/next button on the page, that retrieves a single row one at a time.
Totally there's ~500,000 rows.
When I "page" through each subscribtion number, the form gets filled with subscriber details. What approach should I use on the SQL server?
Using the ROW_NUMBER() function seems a bit overkill as it has to number all ~500.000 rows (I guess?), so what other possible solutions are there?
Thanks in advance!
ROW_NUMBER() is probably your best choice.
From this MSDN article: http://msdn.microsoft.com/en-us/library/ms186734.aspx
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS 'RowNumber'
FROM Sales.SalesOrderHeader
)
SELECT *
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;
And just subsititute 50 and 60 with a parameter for the row number you want.
Tommy, if your user has time to page through 500,000 rows at one page per row, then he/she is unique.
I guess what I am saying here is that you may be able to provide a better UX. When - Too many pages? Build a search feature.
There are two potential workarounds (for this purpose, using a start of 201, pages of 100):
SQL
SELECT TOP 100 * FROM MyTable WHERE ID > 200 ORDER BY ID
LINQ to SQL
var MyRows = (from t in db.Table
order by t.ID ascending
select t).Skip(200).Take(100)
If your ID field has a clustered index, use the former. If not, both of these will take the same amount of time (LINQ returns 500,000 rows, then skips, then takes).
If you're sorting by something that's NOT ID and you have it indexed, use ROW_NUMBER().
Edit: Because the OP isn't sorting by ID, the only solution is ROW_NUMBER(), which is the clause that I put at the end there.
In this case, the table isn't indexed, so please see here for ideas on how to index to improve query performance.

Resources