Firebase Cohorts in BigQuery - firebase

I am trying to replicate Firebase Cohorts using BigQuery. I tried the query from this post: Firebase exported to BigQuery: retention cohorts query, but the results I get don't make much sense.
I manage to get the users for period_lag 0 similar to what I can see in Firebase, however, the rest of the numbers don't look right:
Results:
There is one of the period_lag missing (only see 0,1 and 3 -> no 2) and the user counts for each lag period don't look right either! I would expect to see something like that:
Firebase Cohort:
I'm pretty sure that the issue is in how I replaced the parameters in the original query with those from Firebase. Here are the bits that I have updated in the original query:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
JOIN `dataset.app_events_*` AS questions
ON questions.user_dim.app_info.app_instance_id = answers.user_dim.app_info.app_instance_id
-- WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
(...)
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2017-11-01'))
ORDER BY cohort, period_lag, period_label
So I'm using user_dim.first_open_timestamp_micros instead of create_date and user_dim.app_info.app_instance_id instead of id and parent_id. Any idea what I'm doing wrong?

I think there is a misunderstanding in the concept of how and which data to retrieve into the activities table. Let me state the differences between the case presented in the other StackOverflow question you linked, and the case you are trying to reproduce:
In the other question, answers.creation_date refers to a date value that is not fix, and can have different values for a single user. I mean, the same user can post two different answers in two different dates, that way, you will end up with two activities entries like: {[ID:user1, date:2018-01],[ID:user1, date:2018-02],[ID:user2, date:2018-01]}.
In your question, the use of answers.user_dim.first_open_timestamp_micros refers to a date value that is fixed in the past, because as stated in the documentation, that variable refers to The time (in microseconds) at which the user first opened the app. That value is unique, and therefore, for each user you will only have one activities entry, like:{[ID:user1, date:2018-01],[ID:user2, date:2018-02],[ID:user3, date:2018-01]}.
I think that is the reason why you are not getting information about the lagged retention of users, because you are not recording each time a user accesses the application, but only the first time they did.
Instead of using answers.user_dim.first_open_timestamp_micros, you should look for another value from the ones available in the documentation link I shared before, possibly event_dim.date or event_dim.timestamp_micros, although you will have to take into account that these fields refer to an event and not to a user, so you should do some pre-processing first. For testing purposes, you can use some of the publicly available BigQuery exports for Firebase.
Finally, as a side note, it is pointless to JOIN a table with itself, so regarding your edited Standard SQL query, it should better be:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
GROUP BY id, period

Related

Google Analytics: How to properly filter ga:1dayUsers and ga:30dayUsers

Question: What is the right way to filter active users based on the presence of an event?
I'm trying to report on a count of users that have performed a particular action (purchased an item) on my site.
The aim is to have a Daily Unique Buyer (akin to DAU or 1dayUsers) and Monthly Unique Buyer (akin MAU or 30dayUser) metric.
For the Daily Unique Buyer metric I have tried two separate approaches and I am getting different results for both.
Approach 1) Use ga:Users metric and apply filter ga:eventCategory=="Purchase"
Approach 2) Create custom Segment, Ensure that Advanced Filter condition is for Users (not Sessions) and set the same filter ga:eventCategory=="Purchase"
The first approach seems to yield the desired result when compared to the second.
Unfortunately, the first approach does not extend to computing the same metric for Monthly Unique Buyers.
Most post on StackOverflow suggest that creating a segment (approach 2) is the right way forward. This however, yields more users than events, which can't be correct.
Even more perplexing - Applying the segment in Audience -> Active Users interface yields a different result to programmatic app-script query below
const optArgs = {
'dimensions': 'ga:date',
'sort': '-ga:date','
start-index': '1',
'max-results': 250,
'segment: 'gaid::xxxx',
}
Analytics.Data.Ga.get(
myViewId, startDate, endDate, 'ga:1dayUsers', optArgs
);
update: For those that struggled with this. I don't claim to understand why, but I was able to get the correct number by querying the desired metrics 1dayUsers and 30dayUsers one date at a time.
Running the report over a date range failed. I checked this with the list of actual active users (under User Explorer in the interface) and both 1 day and 30 day metrics are correct.
Would love for someone to explain why this is needed.

How do I create a running count of outcomes sequentially by date and unique to a specific person/ID?

I have a list of unique customers who have made transactions over a year (Jan – Dec). They have bought products using 3 different methods (card, cash, check). My goal is to build a multi-classification model to predict the method pf payment.
To do this I am engineering some Recency and Frequency features into my training data, but am having trouble with the following frequency count because the only way I know how to do it is in Excel using the Countifs and SUMIFs functions, which are inhibitingly slow. If someone can help and/or suggest another solution, it would be very much appreciated:
So I have a data set with 3 columns (Customer ID, Purchase Date, and Payment Type) that is sorted by Purchase Date then Customer ID. How do I then get a prior frequency count of payment type by date that does not include the count of the current row transaction or any future transactions that are > the Purchase Date. So basically I want to do a running count of each payment option, based on a unique Customer ID, and a date range that is < purchase date of that training row. In my head I see it as “crawling” backwards through the transactions and counting. Simplified screenshot of data frame is below with the 3 prior count columns I am looking to generate programmatically.
Screenshot
This gives you the answer as a list of CustomerID, PurchaseDate, PaymentMethod and prior counts
SELECT CustomerID, PurchaseDate, PaymentMethod,
(
select count(CustomerID) from History T
where
T.CustomerID=History.CustomerID
and T.PaymentMethod=History.PaymentMethod
and T.PurchaseDate<History.PurchaseDate
)
AS PriorCount
FROM History;
You can save this query and use it as the source for a crosstab query to get the columnar format you want
Some notes:
I assumed "History" as the source table name - you can change the query above to use the correct source
To use this as a query, open a new query in design view. Close the window that asks what tables the query is to be built on. Open the SQL view of the query design - like design view, but it shows the SQL instead of the normal design interface. Copy the above into the SQL view.
You should now be able to switch to datasheet view and see the results
When the query is working to your satisfaction, save it with any appropriate name
Open a new query in design view
When you get the list of tables to include, switch to the list of queries and include the query you just saved
Change the query type to crosstab and update the query as needed to select rows, columns and values - look up "access crosstab queries" if you need more help.
Another tip to see what is happening here:
You can take the subquery - the parts inside the () above - and make
just that statement into it's own query, excluding the opening and closing (). Then you can look at it's design view to see what it does
Save it with an appropriate name and put it into the query above in place of the statement in () - then you can look at the design view.
Sometimes it's easier to visualize and learn from 2 queries strung together this way than to work with sub queries.

Counting session from each page from the data exported to bigquery from google analytics

I have been trying to count the session for each page using bigquery where data is exported to bigquery from GA. The schema of the data can be found here.
I have tried following query
SELECT
hits.page.pagePath AS page,
COUNT(totals.visits) AS sessions
FROM
[xxxxxxx.ga_sessions_20160801]
WHERE
REGEXP_MATCH(hits.page.pagePath, r'(orderComplete|checkout)')
AND hits.type = 'PAGE'
GROUP BY
page
ORDER BY
sessions DESC
I compared the result of the query with numbers that I get from the GA but the result is quite different. I expected that above query would give total session for each page but it gives total pageviews for each page. In other words result of above query exactly match with pageviews of each page instead of sessions of each page.
I also tried the following query
SELECT
hits.page.pagePath AS page,
COUNT(hits.isEntrance) AS sessions
FROM
[xxxxxxx.ga_sessions_20160801]
WHERE
REGEXP_MATCH(hits.page.pagePath, r'(orderComplete|checkout)')
AND hits.type = 'PAGE'
GROUP BY
page
ORDER BY
sessions DESC
The result this time is very close to actual but not exactly the same as numbers that I am getting from GA. This time bigquery result is slightly less than that of the GA for some pages.
There is no sampling in GA in my case otherwise result is acceptable because error is between 0.5% to 4%
I am working with raw data without any filter on GA profile and same data is exported to bigquery.
Question: How is session counted when we count session by pages?
When I don't group the result by hits.page.pagePath there is no mismatch of results that I get from GA and that from bigquery
Instead of using COUNT(totals.visits), what if you use COUNT(1)? The results of COUNT will vary depending on whether you are using a repeated field. Possibly relevant question with some in depth answers: BigQuery flattens when using field with same name as repeated field
As an aside, standard SQL (uncheck "Use Legacy SQL" under "Show Options") has less surprising semantics around counting, although it would require you to be more explicit with operations on arrays in this case.
To count sessions, I use COUNT(visitId) instead of COUNT(totals.visits). This seems to give me numbers identical--or very, very close--to what I see in GA.

Access 2010 Query with Parameter and Sort

I have a problem that I've been going round and round with in Access 2010. Imagine a table with these columns:
Name Date Time
Now, I have a query that asks the user to input a begin date and an end date and returns all records that are between those two dates. This works fine. However, as soon as I add a sort to the Date column things go awry. Once you put a sort on a column with a parameter the user gets asked to enter the parameter twice. From what I've been able to find out this is normal (although annoying) behavior in Access.
If I add the Date column in a second time and show the column with the sort and don't show the column with the parameter it works fine. The query would look something like:
Name Date (shown & sorted) Date (not shown & parameters) Time
Now when I run the query it all works well and comes out the way I want it to. This would obviously be a great solution then. However, there's another problem. When I save the query, leave, and reopen the query the two columns are merged back into each other. Thus, the change is lost and the user again sees two inputs.
My question is this: what can I do differently to achieve the desired results?
Some possible things I've thought about but don't know the answer to are:
Is there a way to make it so the columns don't merge? Do I have to use a form with the input boxes and take the data from that (I'd prefer not to do that as it will require a lot of additional work to handle the various things I am doing in the database). Is there some obvious thing I'm missing?
Thanks for any suggestions.
FYI: Here is the SQL from the query
SELECT Intentions.Intention, Intentions.MassDate, Intentions.[Time Requested], Intentions.[Place Requested], Intentions.[Offered By], Intentions.Completed
FROM Intentions
WHERE (((Intentions.MassDate) Between [Enter start date] And [Enter end date]))
ORDER BY Intentions.MassDate, Intentions.[Time Requested];
It is true that sometimes the Query Designer in Access will "reorganize" a query when you save it. However, I don't recall an instance where such a reorganization actually broke anything.
For what it's worth, the following query seems to do what you desire. After saving and re-opening it looks and behaves just the same:
For reference, the SQL behind it is
PARAMETERS startDate DateTime, endDate DateTime;
SELECT NameDateTime.Name, NameDateTime.Date, NameDateTime.Time
FROM NameDateTime
WHERE (((NameDateTime.Date) Between [startDate] And [endDate]))
ORDER BY NameDateTime.Date DESC , NameDateTime.Time DESC;
I have had the same problem and I have discovered the reason:
If, after you have run your query, sort a collumn in the result grid and the say yes to save changes to the query the sort action will be stored with the query. This will actually cause the query to run twice. First to create the result and then one more time to sort. You'll therefore be asked twice for the parameters.
SOLUTION: Run the query (entering your parameters twice ;-) ). Then remove the Sorting by clicking on the AZ-eraser symbol in the task bar above (in the sorting compartment).
Then open your query in design-mode and add the sorting order to the appropriate collumn.
Your are then good to go.
Regards
Jan

Missing values in google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910

I am working with the practice repository in preparation for doing upcoming work with a large enterprise client using BQ. The repository link is: google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910
I have 3 questions to ask in relation to the sample repository & a query that was run (please see the bottom of the link for the query that motivated the question:
1) What is the difference between customDimensions.index, customDimensions.value and hits.customDimensions.index, hits.customDimensions.value?
2) If a single hit has multiple custom dimensions/metrics how is that returned/queried? I only see single dimensions matching at the hit level in the sample data.
3) There are no custom metric values passed in the example data, what will those values look like?
Here is the query that motivated the previous 3 questions:
SELECT hits.page.pagePath AS urls,
hits.time,
customDimensions.index,
customDimensions.value,
hits.customMetrics.index,
hits.customMetrics.value,
trafficSource.medium,
hits.customVariables.index,
hits.customVariables.customVarName,
hits.customVariables.customVarValue
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
Every record in that table represents one Google Analytics Session. Big Query has this concept of nested fields and that's how individual hits are defined. They are nested into the hits record.
Answering your questions:
1) customDimensions.index and customDimensions.value are the index and value for user or session scoped custom dimensions. hits.customDimensions.index and hits.customDimensions.value re custom Dimensions set at hit scope level. The scope is defined when you create the custom Dimension through GA interface. indexes are integers from 1 to 20 (as defined in the Admin section) and value is the string passed as the value for that custom Dimension. More info about Custom Dimensions/Metrics
2) Both rows and rows.customDimensions are REPEATED RECORDS in Big Query. So in essence every row in that BQ table looks like this:
|- date
|- (....)
+- hits
|- time
+- customDimensions
|- index
|- value
But when you query the data this should be FLATTEN by default. Because it's flatten if a single hit has multiple custom dimensions and metrics it should show multiple rows, one for each.
3) Should be the same as customDimensions but the values are INTEGER instead of STRINGS.
For a simpler and more educational dataset I suggest that you create a brand new BQ table and load the data provided on this developer document page.
PS: Tell my good friends at Cardinal Path that Eduardo said Hello!

Resources