does anyone know how to extract checkout behaviour from the Google Analytics export in BigQuery?
E.g. I'd like to calculate abandonment at each checkout stage. I've raked through the schema -
https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089
but it doesn't seem to have the equivalent data from GA i.e. the details within shopping stages such as
"CHECKOUT_1_ABANDONMENT"
.
I can get each checkout step using hits_eCommerceAction_step but can't calculate exits here, they're always just blank when I do a count of hits.isExit
hits.isExit refers to the last page in the session. It will not help you here, unless you want to know if any step was also the session exit.
Regarding e-commerce steps, you could define the highest step number per session as being the exit step or the last one seen - but I guess the highest makes more sense?!
Oh, and you have to translate what each step number means by yourself. It literally just tracks the number, not the meaning.
You can do it like that:
SELECT
(SELECT MAX(ecommerceaction.step) FROM t.hits) AS maxStep,
SUM(totals.visits) AS sessions
FROM `project.dataset.ga_sessions_2018*` t
GROUP BY 1
ORDER BY 1
If you want the "last step in a session"-logic, you would do it like that:
SELECT
(SELECT ecommerceaction.step FROM t.hits WHERE ecommerceaction.step is not null ORDER BY hitnumber DESC LIMIT 1) AS lastStep,
SUM(totals.visits) AS sessions
FROM `project.dataset.ga_sessions_2018*` t
GROUP BY 1
ORDER BY 1
I didn't check if these are translations to Google Analytics numbers but should be helpful, I hope, getting in this direction.
Related
I am trying to replicate Firebase Cohorts using BigQuery. I tried the query from this post: Firebase exported to BigQuery: retention cohorts query, but the results I get don't make much sense.
I manage to get the users for period_lag 0 similar to what I can see in Firebase, however, the rest of the numbers don't look right:
Results:
There is one of the period_lag missing (only see 0,1 and 3 -> no 2) and the user counts for each lag period don't look right either! I would expect to see something like that:
Firebase Cohort:
I'm pretty sure that the issue is in how I replaced the parameters in the original query with those from Firebase. Here are the bits that I have updated in the original query:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
JOIN `dataset.app_events_*` AS questions
ON questions.user_dim.app_info.app_instance_id = answers.user_dim.app_info.app_instance_id
-- WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
(...)
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2017-11-01'))
ORDER BY cohort, period_lag, period_label
So I'm using user_dim.first_open_timestamp_micros instead of create_date and user_dim.app_info.app_instance_id instead of id and parent_id. Any idea what I'm doing wrong?
I think there is a misunderstanding in the concept of how and which data to retrieve into the activities table. Let me state the differences between the case presented in the other StackOverflow question you linked, and the case you are trying to reproduce:
In the other question, answers.creation_date refers to a date value that is not fix, and can have different values for a single user. I mean, the same user can post two different answers in two different dates, that way, you will end up with two activities entries like: {[ID:user1, date:2018-01],[ID:user1, date:2018-02],[ID:user2, date:2018-01]}.
In your question, the use of answers.user_dim.first_open_timestamp_micros refers to a date value that is fixed in the past, because as stated in the documentation, that variable refers to The time (in microseconds) at which the user first opened the app. That value is unique, and therefore, for each user you will only have one activities entry, like:{[ID:user1, date:2018-01],[ID:user2, date:2018-02],[ID:user3, date:2018-01]}.
I think that is the reason why you are not getting information about the lagged retention of users, because you are not recording each time a user accesses the application, but only the first time they did.
Instead of using answers.user_dim.first_open_timestamp_micros, you should look for another value from the ones available in the documentation link I shared before, possibly event_dim.date or event_dim.timestamp_micros, although you will have to take into account that these fields refer to an event and not to a user, so you should do some pre-processing first. For testing purposes, you can use some of the publicly available BigQuery exports for Firebase.
Finally, as a side note, it is pointless to JOIN a table with itself, so regarding your edited Standard SQL query, it should better be:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
GROUP BY id, period
I am trying to get the count of total.pageviews of people go through the booking page on website. Here is my query.
SELECT sum( totals.pageviews ) AS Searches,Date
FROM `table*`
WHERE exists (
select 1 from unnest(hits) as hits
where hits.page.pagePath ='booking'
)
and date='20161109'
GROUP BY DATE
But I got way more results than what i got from Google Analytics.
Big query result: around 1M
GA: around 300,000
This is the GA page that I am trying to match with
GA result
After looking a bit more into Google Analytics data, I think that you actually want to count entries in hits that match the condition directly instead of relying on totals.pageViews. The problem is that totals.pageViews represents the number of distinct pages visited within a particular session (if I'm using the correct terminology), which includes pages that don't match your filter. I think you want something like this instead:
SELECT
COUNT(*) AS Searches,
Date
FROM `table*`, UNNEST(hits) AS hit
WHERE hit.page.pagePath = 'booking';
This counts the matched pages directly, and will hopefully give the expected numbers.
Try below
SELECT
date,
COUNT(*) AS Searches,
SUM(totals.pageviews) as PageViews
FROM `table*`, UNNEST(hits) AS hit
WHERE hit.page.pagePath = 'booking'
AND hit.hitNumber = 1
GROUP BY date
Searches - number of sessions started with booking page as an entry point to website;
PageViews - number of pageviews in those (above) sessions
I would like to have total(totals.pageview ) for the booking page on
the website. how many times that the booking page has been viewed
First - total(totals.pageview) - doesn't help in identifying what really you need as you are assuming that using total.pageviews field is correct, which seems is not - at least based on the rest of your wording
Secondly, if to assume that what you need is - count of pageviews of the booking page on the website - the only reasonable answer is below
SELECT
date,
COUNT(1) AS BookingPageViews
FROM `table*`, UNNEST(hits) AS hit
WHERE hit.page.pagePath = 'booking'
GROUP BY date
Finally, if you still getting numbers different from what you expect - you need to revisit your what actually you are looking for. It might be that the number that you see in GA represents metric that is different from what you think it represents. This is the only explanation I would see
I found the solution solve this problem:
SELECT count(totals.pageviews) AS Searches,Date
FROM table, UNNEST(hits) as hits
WHERE hits.page.pagePath ='/booking' and hits.type='PAGE'
GROUP BY DATE
Hope this answer can help other people.
I'm querying page views by page from BigQuery. My query is:
SELECT hits.page.pagePath, COUNT(*) as pageViews FROM `bigquery-refresh.refresh.ga_sessions_2015*`,
UNNEST(hits) as hits
WHERE date >= '20150101' AND date < '20150701'
AND geoNetwork.country = "United States"
AND hits.type="PAGE"
GROUP BY hits.page.pagePath
ORDER BY pageViews DESC
I'm comparing this query to the total page views reported from within GA (for the same country and date range), and am finding that the total number of page views in GA is ~0.4% larger than in BigQuery. Is there a reason for this small discrepancy?
I'm not familiar with GA, but here are my random guesses:
(1) As Elliott pointed out, maybe GA includes some extra data
(2) Or maybe GA uses different rule than count(*)
(3) I happen to know that Adwords will adjust the report data even several days later. Maybe GA has the same feature.
Are you sure you're counting the right thing?
On the Schema documentation it says that each row in BQ corresponds to a session (not a hit, nor a pageview), so the count(*) wouldn't be correct and thus show a different number when compared to GA's UI.
The schema also shows that for pageviews you have the totals:
totals.pageviews (check definition here)
totals.hits (check definition here)
So, every interaction with the page is a hit. Can you confirm that by using the totals.pageviews you get to the correct number?
I have been trying to count the session for each page using bigquery where data is exported to bigquery from GA. The schema of the data can be found here.
I have tried following query
SELECT
hits.page.pagePath AS page,
COUNT(totals.visits) AS sessions
FROM
[xxxxxxx.ga_sessions_20160801]
WHERE
REGEXP_MATCH(hits.page.pagePath, r'(orderComplete|checkout)')
AND hits.type = 'PAGE'
GROUP BY
page
ORDER BY
sessions DESC
I compared the result of the query with numbers that I get from the GA but the result is quite different. I expected that above query would give total session for each page but it gives total pageviews for each page. In other words result of above query exactly match with pageviews of each page instead of sessions of each page.
I also tried the following query
SELECT
hits.page.pagePath AS page,
COUNT(hits.isEntrance) AS sessions
FROM
[xxxxxxx.ga_sessions_20160801]
WHERE
REGEXP_MATCH(hits.page.pagePath, r'(orderComplete|checkout)')
AND hits.type = 'PAGE'
GROUP BY
page
ORDER BY
sessions DESC
The result this time is very close to actual but not exactly the same as numbers that I am getting from GA. This time bigquery result is slightly less than that of the GA for some pages.
There is no sampling in GA in my case otherwise result is acceptable because error is between 0.5% to 4%
I am working with raw data without any filter on GA profile and same data is exported to bigquery.
Question: How is session counted when we count session by pages?
When I don't group the result by hits.page.pagePath there is no mismatch of results that I get from GA and that from bigquery
Instead of using COUNT(totals.visits), what if you use COUNT(1)? The results of COUNT will vary depending on whether you are using a repeated field. Possibly relevant question with some in depth answers: BigQuery flattens when using field with same name as repeated field
As an aside, standard SQL (uncheck "Use Legacy SQL" under "Show Options") has less surprising semantics around counting, although it would require you to be more explicit with operations on arrays in this case.
To count sessions, I use COUNT(visitId) instead of COUNT(totals.visits). This seems to give me numbers identical--or very, very close--to what I see in GA.
I am trying to get the figures related to search performance based on goal completions in Google Analytics.This goals are based on urls, so as a first step what I did was getting the total completions adding as many ORs as goal urls we have and that's fine. So far so good.
The problem is when we have to segment it by "visits with search". Based on url as well: pagepath like "%search_parameter%" but this time in a separate statement as the previous goal urls:
SELECT sum(totals.visits)
FROM [XXXXXX.ga_sessions_20150101]
WHERE
(
REGEXP_MATCH (hits.page.pagePath,r'/goal1/')
or REGEXP_MATCH (hits.page.pagePath,r'/goal2/')
or REGEXP_MATCH (hits.page.pagePath,r'/goal3/')
or REGEXP_MATCH (hits.page.pagePath,r'/goal4/')
)
and REGEXP_MATCH (hits.page.pagePath,r'/search/')
In Google Analytics interface of course I have goals completed from people doing searches, so I don't understand what may have been missing when constructing this query.
Any help?
Many Thanks!
If I correctly understood that "visits with search" mean that at least one url in hits.page.pagePath matches '/search/', then I think the following should work for you:
SELECT sum(totals.visits)
FROM
(SELECT
totals.visits,
hits.page.PagePath,
SOME(REGEXP_MATCH(hits.page.pagePath,r'/search/'))
WITHIN RECORD AS has_search
FROM [XXXXXX.ga_sessions_20150101]
WHERE REGEXP_MATCH(hits.page.pagePath,r'[/goal1/|/goal2/|/goal3/|/goal4/]')
)
WHERE has_search