Getting ga:socialNetwork dimension in BigQuery Export - google-analytics

I'm trying to retrieve site referral data from social networks via BigQuery Export.
I've gotten the referral path from such sites, but what I cannot seem to find is the neatly categorized field that is available in Google Analytics.
i.e. ga:socialNetwork
Anyone know where to find this data?
So far, I've looked here: https://support.google.com/analytics/answer/3437719?hl=en
(and, in our data, of course)
Cheers!

Although the ga:socialNetwork dimension isn't currently available via BigQuery Export, as you mentioned, you can get the referral path using trafficSource.source.
You can see the difference between these two fields by running this query (against the Core Reporting API, which has both fields). You can then use the result as a lookup table for your data.

If anyone is interested here was my solution, based on Andy's answer:
SELECT
Week,
IF (SocialNetwork IS NULL, Medium, "social" ) AS Medium,
Referral_URL,
SocialNetwork,
Total_Sessions,
Avg_Time_On_Site_in_Mins,
Avg_Session_Page_Depth,
Bounce_Rate,
FROM (
SELECT
Week,
Medium,
Referral_HostName,
Referral_URL,
SocialNetworks.socialNetwork AS SocialNetwork,
Total_Sessions,
Avg_Time_On_Site_in_Mins,
Avg_Session_Page_Depth,
Bounce_Rate,
FROM
[zzzzzzz.ga_sessions_20141223] AS All_Sessions
LEFT JOIN EACH [GA_API.SocialNetworks] AS SocialNetworks
ON ALL_Sessions.Referral_HostName = SocialNetworks.Source
GROUP EACH BY Week, Medium, Full_URL, Referral_HostName, Referral_URL, SocialNetwork, Total_Sessions, Avg_Time_On_Site_in_Mins, Avg_Session_Page_Depth, Bounce_Rate,
ORDER BY Total_Sessions DESC )
GROUP EACH BY Week, Medium, Full_URL, Referral_URL, SocialNetwork, Total_Sessions, Avg_Time_On_Site_in_Mins, Avg_Session_Page_Depth, Bounce_Rate,
ORDER BY Total_Sessions DESC;

This is now available in BigQuery Export with the field name hits.social.socialNetwork.
The detailed documentation is here: https://support.google.com/analytics/answer/3437719?hl=en
Following this documentation I ran a sample query which worked fine
SELECT
COUNT(totals.visits),
hits.social.socialNetwork
FROM
[project:dataset.ga_sessions_20161101]
GROUP BY
hits.social.socialNetwork
ORDER BY
1 DESC

Related

Firebase Cohorts in BigQuery

I am trying to replicate Firebase Cohorts using BigQuery. I tried the query from this post: Firebase exported to BigQuery: retention cohorts query, but the results I get don't make much sense.
I manage to get the users for period_lag 0 similar to what I can see in Firebase, however, the rest of the numbers don't look right:
Results:
There is one of the period_lag missing (only see 0,1 and 3 -> no 2) and the user counts for each lag period don't look right either! I would expect to see something like that:
Firebase Cohort:
I'm pretty sure that the issue is in how I replaced the parameters in the original query with those from Firebase. Here are the bits that I have updated in the original query:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
JOIN `dataset.app_events_*` AS questions
ON questions.user_dim.app_info.app_instance_id = answers.user_dim.app_info.app_instance_id
-- WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
(...)
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2017-11-01'))
ORDER BY cohort, period_lag, period_label
So I'm using user_dim.first_open_timestamp_micros instead of create_date and user_dim.app_info.app_instance_id instead of id and parent_id. Any idea what I'm doing wrong?
I think there is a misunderstanding in the concept of how and which data to retrieve into the activities table. Let me state the differences between the case presented in the other StackOverflow question you linked, and the case you are trying to reproduce:
In the other question, answers.creation_date refers to a date value that is not fix, and can have different values for a single user. I mean, the same user can post two different answers in two different dates, that way, you will end up with two activities entries like: {[ID:user1, date:2018-01],[ID:user1, date:2018-02],[ID:user2, date:2018-01]}.
In your question, the use of answers.user_dim.first_open_timestamp_micros refers to a date value that is fixed in the past, because as stated in the documentation, that variable refers to The time (in microseconds) at which the user first opened the app. That value is unique, and therefore, for each user you will only have one activities entry, like:{[ID:user1, date:2018-01],[ID:user2, date:2018-02],[ID:user3, date:2018-01]}.
I think that is the reason why you are not getting information about the lagged retention of users, because you are not recording each time a user accesses the application, but only the first time they did.
Instead of using answers.user_dim.first_open_timestamp_micros, you should look for another value from the ones available in the documentation link I shared before, possibly event_dim.date or event_dim.timestamp_micros, although you will have to take into account that these fields refer to an event and not to a user, so you should do some pre-processing first. For testing purposes, you can use some of the publicly available BigQuery exports for Firebase.
Finally, as a side note, it is pointless to JOIN a table with itself, so regarding your edited Standard SQL query, it should better be:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
GROUP BY id, period

Small difference in page views within BigQuery and GA

I'm querying page views by page from BigQuery. My query is:
SELECT hits.page.pagePath, COUNT(*) as pageViews FROM `bigquery-refresh.refresh.ga_sessions_2015*`,
UNNEST(hits) as hits
WHERE date >= '20150101' AND date < '20150701'
AND geoNetwork.country = "United States"
AND hits.type="PAGE"
GROUP BY hits.page.pagePath
ORDER BY pageViews DESC
I'm comparing this query to the total page views reported from within GA (for the same country and date range), and am finding that the total number of page views in GA is ~0.4% larger than in BigQuery. Is there a reason for this small discrepancy?
I'm not familiar with GA, but here are my random guesses:
(1) As Elliott pointed out, maybe GA includes some extra data
(2) Or maybe GA uses different rule than count(*)
(3) I happen to know that Adwords will adjust the report data even several days later. Maybe GA has the same feature.
Are you sure you're counting the right thing?
On the Schema documentation it says that each row in BQ corresponds to a session (not a hit, nor a pageview), so the count(*) wouldn't be correct and thus show a different number when compared to GA's UI.
The schema also shows that for pageviews you have the totals:
totals.pageviews (check definition here)
totals.hits (check definition here)
So, every interaction with the page is a hit. Can you confirm that by using the totals.pageviews you get to the correct number?

Counting session from each page from the data exported to bigquery from google analytics

I have been trying to count the session for each page using bigquery where data is exported to bigquery from GA. The schema of the data can be found here.
I have tried following query
SELECT
hits.page.pagePath AS page,
COUNT(totals.visits) AS sessions
FROM
[xxxxxxx.ga_sessions_20160801]
WHERE
REGEXP_MATCH(hits.page.pagePath, r'(orderComplete|checkout)')
AND hits.type = 'PAGE'
GROUP BY
page
ORDER BY
sessions DESC
I compared the result of the query with numbers that I get from the GA but the result is quite different. I expected that above query would give total session for each page but it gives total pageviews for each page. In other words result of above query exactly match with pageviews of each page instead of sessions of each page.
I also tried the following query
SELECT
hits.page.pagePath AS page,
COUNT(hits.isEntrance) AS sessions
FROM
[xxxxxxx.ga_sessions_20160801]
WHERE
REGEXP_MATCH(hits.page.pagePath, r'(orderComplete|checkout)')
AND hits.type = 'PAGE'
GROUP BY
page
ORDER BY
sessions DESC
The result this time is very close to actual but not exactly the same as numbers that I am getting from GA. This time bigquery result is slightly less than that of the GA for some pages.
There is no sampling in GA in my case otherwise result is acceptable because error is between 0.5% to 4%
I am working with raw data without any filter on GA profile and same data is exported to bigquery.
Question: How is session counted when we count session by pages?
When I don't group the result by hits.page.pagePath there is no mismatch of results that I get from GA and that from bigquery
Instead of using COUNT(totals.visits), what if you use COUNT(1)? The results of COUNT will vary depending on whether you are using a repeated field. Possibly relevant question with some in depth answers: BigQuery flattens when using field with same name as repeated field
As an aside, standard SQL (uncheck "Use Legacy SQL" under "Show Options") has less surprising semantics around counting, although it would require you to be more explicit with operations on arrays in this case.
To count sessions, I use COUNT(visitId) instead of COUNT(totals.visits). This seems to give me numbers identical--or very, very close--to what I see in GA.

GA bigquery page views by page title

I am a GA premium user and new to bigquery. I am doing some data exploration and want to pull page views by title. I don't think you can do something like the following query because the totals.pageview records are an aggregate already:
SELECT hits.page.pageTitle, sum(totals.pageviews) FROM [sample.ga_sessions_20150125]
GROUP BY hits.page.pageTitle
Can someone explain how I can recreate pulling pageviews from raw hits data? Thanks!
Be aware, totals contains aggregate values across the visit of the visitor, so the context for those numbers is the visitor.
SELECT hits.page.pageTitle,
count(DISTINCT fullVisitorId)
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY hits.page.pageTitle
this query calculates the unique visitors per page.
You can run the above query on BigQuery sample dataset.
Ahh, okay, I got it. I used:
select hits.page.pageTitle, count(*)
from [sample.ga_sessions_20150125]
where hits.type='PAGE'
group by hits.page.pageTitle
Just needed the hits.type value, and to not use the totals number. Thanks!

BigQuery for Google Analytics Export combining ANDs and ORs for visits with specific hits.page.pagePath doesn't work

I am trying to get the figures related to search performance based on goal completions in Google Analytics.This goals are based on urls, so as a first step what I did was getting the total completions adding as many ORs as goal urls we have and that's fine. So far so good.
The problem is when we have to segment it by "visits with search". Based on url as well: pagepath like "%search_parameter%" but this time in a separate statement as the previous goal urls:
SELECT sum(totals.visits)
FROM [XXXXXX.ga_sessions_20150101]
WHERE
(
REGEXP_MATCH (hits.page.pagePath,r'/goal1/')
or REGEXP_MATCH (hits.page.pagePath,r'/goal2/')
or REGEXP_MATCH (hits.page.pagePath,r'/goal3/')
or REGEXP_MATCH (hits.page.pagePath,r'/goal4/')
)
and REGEXP_MATCH (hits.page.pagePath,r'/search/')
In Google Analytics interface of course I have goals completed from people doing searches, so I don't understand what may have been missing when constructing this query.
Any help?
Many Thanks!
If I correctly understood that "visits with search" mean that at least one url in hits.page.pagePath matches '/search/', then I think the following should work for you:
SELECT sum(totals.visits)
FROM
(SELECT
totals.visits,
hits.page.PagePath,
SOME(REGEXP_MATCH(hits.page.pagePath,r'/search/'))
WITHIN RECORD AS has_search
FROM [XXXXXX.ga_sessions_20150101]
WHERE REGEXP_MATCH(hits.page.pagePath,r'[/goal1/|/goal2/|/goal3/|/goal4/]')
)
WHERE has_search

Resources