I am working with the practice repository in preparation for doing upcoming work with a large enterprise client using BQ. The repository link is: google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910
I have 3 questions to ask in relation to the sample repository & a query that was run (please see the bottom of the link for the query that motivated the question:
1) What is the difference between customDimensions.index, customDimensions.value and hits.customDimensions.index, hits.customDimensions.value?
2) If a single hit has multiple custom dimensions/metrics how is that returned/queried? I only see single dimensions matching at the hit level in the sample data.
3) There are no custom metric values passed in the example data, what will those values look like?
Here is the query that motivated the previous 3 questions:
SELECT hits.page.pagePath AS urls,
hits.time,
customDimensions.index,
customDimensions.value,
hits.customMetrics.index,
hits.customMetrics.value,
trafficSource.medium,
hits.customVariables.index,
hits.customVariables.customVarName,
hits.customVariables.customVarValue
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
Every record in that table represents one Google Analytics Session. Big Query has this concept of nested fields and that's how individual hits are defined. They are nested into the hits record.
Answering your questions:
1) customDimensions.index and customDimensions.value are the index and value for user or session scoped custom dimensions. hits.customDimensions.index and hits.customDimensions.value re custom Dimensions set at hit scope level. The scope is defined when you create the custom Dimension through GA interface. indexes are integers from 1 to 20 (as defined in the Admin section) and value is the string passed as the value for that custom Dimension. More info about Custom Dimensions/Metrics
2) Both rows and rows.customDimensions are REPEATED RECORDS in Big Query. So in essence every row in that BQ table looks like this:
|- date
|- (....)
+- hits
|- time
+- customDimensions
|- index
|- value
But when you query the data this should be FLATTEN by default. Because it's flatten if a single hit has multiple custom dimensions and metrics it should show multiple rows, one for each.
3) Should be the same as customDimensions but the values are INTEGER instead of STRINGS.
For a simpler and more educational dataset I suggest that you create a brand new BQ table and load the data provided on this developer document page.
PS: Tell my good friends at Cardinal Path that Eduardo said Hello!
Related
I am trying to replicate Firebase Cohorts using BigQuery. I tried the query from this post: Firebase exported to BigQuery: retention cohorts query, but the results I get don't make much sense.
I manage to get the users for period_lag 0 similar to what I can see in Firebase, however, the rest of the numbers don't look right:
Results:
There is one of the period_lag missing (only see 0,1 and 3 -> no 2) and the user counts for each lag period don't look right either! I would expect to see something like that:
Firebase Cohort:
I'm pretty sure that the issue is in how I replaced the parameters in the original query with those from Firebase. Here are the bits that I have updated in the original query:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
JOIN `dataset.app_events_*` AS questions
ON questions.user_dim.app_info.app_instance_id = answers.user_dim.app_info.app_instance_id
-- WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
(...)
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2017-11-01'))
ORDER BY cohort, period_lag, period_label
So I'm using user_dim.first_open_timestamp_micros instead of create_date and user_dim.app_info.app_instance_id instead of id and parent_id. Any idea what I'm doing wrong?
I think there is a misunderstanding in the concept of how and which data to retrieve into the activities table. Let me state the differences between the case presented in the other StackOverflow question you linked, and the case you are trying to reproduce:
In the other question, answers.creation_date refers to a date value that is not fix, and can have different values for a single user. I mean, the same user can post two different answers in two different dates, that way, you will end up with two activities entries like: {[ID:user1, date:2018-01],[ID:user1, date:2018-02],[ID:user2, date:2018-01]}.
In your question, the use of answers.user_dim.first_open_timestamp_micros refers to a date value that is fixed in the past, because as stated in the documentation, that variable refers to The time (in microseconds) at which the user first opened the app. That value is unique, and therefore, for each user you will only have one activities entry, like:{[ID:user1, date:2018-01],[ID:user2, date:2018-02],[ID:user3, date:2018-01]}.
I think that is the reason why you are not getting information about the lagged retention of users, because you are not recording each time a user accesses the application, but only the first time they did.
Instead of using answers.user_dim.first_open_timestamp_micros, you should look for another value from the ones available in the documentation link I shared before, possibly event_dim.date or event_dim.timestamp_micros, although you will have to take into account that these fields refer to an event and not to a user, so you should do some pre-processing first. For testing purposes, you can use some of the publicly available BigQuery exports for Firebase.
Finally, as a side note, it is pointless to JOIN a table with itself, so regarding your edited Standard SQL query, it should better be:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
GROUP BY id, period
Is there a way to use the smart list in a campaign to exclude records that match a filter?
I've got a custom table linked to the lead records, but it's a many-to-one style. I'm trying to suppress lead records where the history records match certain values.
Problem is that there seems to be no way to do it. I can have it include leads with no history records or leads with history records without certain values, but if the same lead has multiple history records, it will show up if either record has a value not in the exclusion.
What I want is the leads where NONE of the history records for it has those certain values, not simply to exclude the history records that match.
If this were a sql join statement, what I'm getting is:
select * from leads
join history on history.leadid = leads.id and history.myval != 'x'
but what I want is:
select * from leads
where id not in (select id from history where myval = 'x')
You might be able to do it by creating a couple of smart lists
a> A smart list that will checks for all the people who have that custom object AND have the exact value(s) that you want to Exclude
b> Then create another smart list with criteria like,
All the people who have,
1> That custom object
AND
2> Who are NOT in the first smart list A
That should give you the people you are looking for
Hope this helps
Rajesh Talele
I have a list of unique customers who have made transactions over a year (Jan – Dec). They have bought products using 3 different methods (card, cash, check). My goal is to build a multi-classification model to predict the method pf payment.
To do this I am engineering some Recency and Frequency features into my training data, but am having trouble with the following frequency count because the only way I know how to do it is in Excel using the Countifs and SUMIFs functions, which are inhibitingly slow. If someone can help and/or suggest another solution, it would be very much appreciated:
So I have a data set with 3 columns (Customer ID, Purchase Date, and Payment Type) that is sorted by Purchase Date then Customer ID. How do I then get a prior frequency count of payment type by date that does not include the count of the current row transaction or any future transactions that are > the Purchase Date. So basically I want to do a running count of each payment option, based on a unique Customer ID, and a date range that is < purchase date of that training row. In my head I see it as “crawling” backwards through the transactions and counting. Simplified screenshot of data frame is below with the 3 prior count columns I am looking to generate programmatically.
Screenshot
This gives you the answer as a list of CustomerID, PurchaseDate, PaymentMethod and prior counts
SELECT CustomerID, PurchaseDate, PaymentMethod,
(
select count(CustomerID) from History T
where
T.CustomerID=History.CustomerID
and T.PaymentMethod=History.PaymentMethod
and T.PurchaseDate<History.PurchaseDate
)
AS PriorCount
FROM History;
You can save this query and use it as the source for a crosstab query to get the columnar format you want
Some notes:
I assumed "History" as the source table name - you can change the query above to use the correct source
To use this as a query, open a new query in design view. Close the window that asks what tables the query is to be built on. Open the SQL view of the query design - like design view, but it shows the SQL instead of the normal design interface. Copy the above into the SQL view.
You should now be able to switch to datasheet view and see the results
When the query is working to your satisfaction, save it with any appropriate name
Open a new query in design view
When you get the list of tables to include, switch to the list of queries and include the query you just saved
Change the query type to crosstab and update the query as needed to select rows, columns and values - look up "access crosstab queries" if you need more help.
Another tip to see what is happening here:
You can take the subquery - the parts inside the () above - and make
just that statement into it's own query, excluding the opening and closing (). Then you can look at it's design view to see what it does
Save it with an appropriate name and put it into the query above in place of the statement in () - then you can look at the design view.
Sometimes it's easier to visualize and learn from 2 queries strung together this way than to work with sub queries.
I am wanting to keep track of multi stage processing job.
Likely just need the following fields
batchId (guid) | eventId (guid) | statusId (int) | timestamp | message (string)
There are relatively small number of events per batch.
I want to be able to easily query events that have a statusId less than n (still being processed or didn't finish processing).
Would using multiple rows for each status change, and querying for latest status be the best approach? I would use global secondary index but StatusId does not seem like a good candidate for hashkey (less than 10 statuses).
Instead of using multiple rows for every status change, if you updated the same event row instead, you could use a technique described in the DynamoDB documentation in the section 'Use a Calculated Value'. Basically this would involve adding another attribute (say 'derivedStatusId') which would be derived by appending a random number to statusId at the time of writing to DynamoDB. For example, for a statusId of 2, derivedStatusId could be one of {"2-00", "2-01", .. "2-99"}. Setting up a Global Secondary Index on derivedStatusId would give you some fan-out that will help in preventing the index from becoming hot.
If you are sure that you will use this index for only unfinished events, then removing the derivedStatusId attribute from the record when it transitions to a finished status will remove it from index as well - which may be a good property if events are expected to finish processing eventually, and if they stay around forever. This technique is called "Sparse Index" and is described in more detail here.
From your question, it seems like keeping status history recording is a desired property (I assume this because you want to have multiple rows for status changes). Consider putting this historical information in the same row. DynamoDB supports list data types and also has a generous 400KB item limit which may just allow you to capture all the desired historical information in the same record.
In Pivotal CRM:
I have a set of applications (lets say Job applications) which I want to create a summary info screen ('Client Form') for.
I want to have a break down of applications by regions; which are defined in another table.
How do I create a grid view of say:
Region | Number of Applications for that region | % of Total
And for bonus points: I just want applications for this month or year
What you need to do is to create a 'read only table' (View)
Create a SQL view (well just the select statement is enough) which will give you the data you need.
Use GROUP BY and INNER JOINS to give you the data you need to calculate the percentage etc.
In Pivotal:
Create a read only table.
Paste in your SQL to the field in the views tab.
Add matching fields; including the ones pivotal normally creates when you make a table. (You need to make sure the data types in the view match what pivotal would normally use.)
Bounce the ACC
Create lists, Querys, Search results lists , client forms as normal
See here for more details: Pivotal CRM Read-Only Views