Use MapReduce or other distributed computation method for an analytics calculation? - dictionary

Let's say I have three basic models: a User, a Company, and a Visit. Every time a User goes to a Company, a Visit is recorded in this format (user_id, company_id, visit_date).
I'd like to be able to calculate the average time between visits for a company. Not visits overall, but specifically how long on average one of their customers waits before returning to the store.
For example, if one user visited on Tuesday, Wednesday, and Friday that gives one "gap" of one day, and one "gap" of two days => (1, 2). If another user visited on Monday and Friday, that gives one gap of 4 days => (4). If a third user visited only once, he should not be considered. The average time between user visits for the company is (1 + 2 + 4) / 3 = 2.333 days.
If I have thousands of users, taps, and companies and I want to calculate a single figure for each company, how should I go about this? I've only done basic MapReduce applications before and I can't figure out what my Map and Reduce steps would be to get this done. Can anyone help me figure out a MapReduce in pseudocode? Or is there some other method of distributed calculation I can reasonably perform? For the record, I'd like to perform this operation on my database every night.

The overly simplistic approach would be to have two job steps.
First job step has a mapper to write key values in the form "company:user" and "visit_date". In the example above, the mapper would write something like:
"user1:companyA" -> "2012/07/16"
"user1:comapnyA" -> "2012/07/17"
"user1:comapnyA" -> "2012/07/19"
"user2:comapnyA" -> "2012/07/15"
"user2:comapnyA" -> "2012/07/19"
...
This means that each call to the reducer will pass all of the visits by a single user to a single company. That means that one call to the reducer will pass in:
"user1:companyA" -> {2012/07/16, 2012/07/17, 2012/07/19}
and another call will pass in:
"user2:companyA" -> {2012/07/15, 2012/07/19}
I'm assuming the set of dates (passed in as an Iterable value) is easily managed as you sort it, figure out the gaps and write a record for each gap as a key value pair in the form "company" and "gap". For example, when passed:
"user1:companyA" -> {2012/07/16, 2012/07/17, 2012/07/19}
The first job's reducer will write to the context:
"companyA" -> 1
"compnayA" -> 2
The second job has a pass-through mapper that just passes the company/gap info on to the reducer. Each call to the reducer gives an Iterable value of gaps for a specific company. Iterate through the data to produce an average and write the key value pair in the form "company" and "average_gap".
If the original set of visits gets too big, we can talk about getting hadoop to do the sorting for you with some custom comparators.

Related

How to design a recommendation system in DynamoDb based on likes

Considering performance as main importance, how would be the best design approach to follow in order to build a recommendation system in DynamoDb?
The system would need to store an url and the numbers of times that topic was 'liked', but my requirement includes the need of searches by daily, weekly, monthly and yearly, for e.g.:
Give me the top 10 of the week
Give me the top 10 of the month
I was thinking about to include the date and time information, so that the query could control it through this field, but not sure wether it is the good in terms of performance.
If the only data structure you had was a hash map, how would you solve this problem?
What if on top of that constraint, you could only update any key up to 1000 times per second, and read a key up to 3000 per second?
How often do you expect your items to get liked? Presumably there will be some that will be hot and liked a lot, while others would almost never get any likes.
How real_time does your system need to be? Can the system be eventually consistent (meaning, would it be ok if you only reported likes as of several minutes ago)?
Let's give this a shot
Disclaimer: this is very much a didactic exercise -- in practice you may want to explore an analytics product, or some other technologies than DynamoDB to accompish this task
Part 1. Representing an Item And Updating Like Counts
First, let's talk about your aggregation/analytics goals: you mentioned that you want to query for "top 10 of the week" or "top 10 of the month" but you didn't specify if that is supposed to mean "calendar week"/"calendar month", or "last 7 days"/"last 30 days".
I'm going to take it literally and assume that "top 10 of the week" means top 10 items from this week that started on the most recent Monday (or Sunday if you roll that way). Same for month: "top 10 of the month" means "top 10 items since the beginning of this month.
In this case, you will probably want to store, for each item:
a count of total all-time likes
a count of likes since the beginning of current month
a count of likes since the beginning of current week
current month number - needed to determine if we need to reset
current week number - needed to determine if we need to reset
And each week, reset the counts for the current week; And each month reset the counts for the current month.
In DynamoDB, this might be represented like so:
{
id: "<item-id>",
likes_all: <numeric>, // total likes of all time
likes_wk: <numeric>, // total likes for the current week
likes_mo: <numeric>, // total likes for the current month
curr_wk: <numeric>, // number of the current week of year, eg. 27
curr_mo: <numeric>, // number of the current month of year, eg. 6
}
Now, you can update the number of likes with an UpdateItem operation, with an UpdateExpression, like so:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc, likes_wk = likes_wk + :lc, likes_mo = likes_mo + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This gives you a simple atomic way to increment the counts and get back the updated values. Notice the :lc value can be any number (not just 1). This will come in handy below.
But there's a catch. You also need to be able to reset the counts if the week or month rolled over, so to do that, you can break the update into two operations:
update the total count (and get the most recent values back)
conditionally update the week and month counts
So, our update sequence becomes:
Step 1. update total count and read back the updated item:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This updates the total count and gives us back the state of the item. Based on the values of the curr_wk and curr_mo, you will have to decide what the update looks like. You may be either incrementing, or setting an absolute value. Let's say we're in the case when the update is being performed after the week rolled over, but not the month. And let's say that the result of the update above looks like this:
{
id: "<item-id>",
likes_all: 1000, // total likes of all time
likes_wk: 70, // total likes for the current week
likes_mo: 150, // total likes for the current month
curr_wk: 26, // number of the week of last update
curr_mo: 6, // number of the month of year of last update
}
curr_wk is 6, but at the time of update, the actual current week should be 7.
Then your update query would look look like this:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET curr_wk = 27, likes_wk = :lc, likes_mo = likes_mo + :lc" \
--condition-expression "curr_wk = :wk AND curr_mo = :mo" \
--expression-attribute-values '{":lc": {"N":"1"}, ":wk": {"N":"26"}, ":lc": {"N":"6"},}' \
--return-values ALL_NEW
The ConditionExpression ensures that we don't reset the likes twice, if two conflicting updates happen at the same time. In that case, one of the updates would fail and you'd have to switch the update back to an increment.
Part 2 - Keeping Track of Statistics
To take care of your statistics you need to keep track of most likes per week and per month.
You can keep a sorted list of hottest items per week and per month. You also can store these lists in Dynamo.
For example, let's say you want to keep track of top 3. You might store something like:
{
id: "item-stats",
week_top: ["item3:4000", "item2:2000", "item9:700"],
month_top: ["item2:100000", "item4:50000", "item3:12000"],
curr_wk: 26,
curr_mo: 6,
sequence: <optimistic-lock-token>
}
Whenever you perform an update for items, you would also update the statistics.
The algorithm for updating statistics will be similar to updating an item, except you can't just use update expressions. Instead you have to implement your own read-modify-write sequence using GetItem, PutItem and ConditionExpression.
First, you read the current values for the item-stats special item, including the value of the current sequence (this is important to detect clobbering)
Then, you figure out if the item(s) you've just updated counts for would make it into the Top-N weekly or monthly list. If so, you would update the week_top and/or month-top attributes and prepare a conditional PutItem request.
The PutItem request must include a conditional check that verifies the sequeuce of the item-stats is the same as what you read earlier. If not, you need to read the item again and re-compute the top-N lists, then attempt to put again.
Also, similar to the way the counts get reset for items, when an update happens you need to check and see if the weekly or monthly top needs to be reset as part of the update.
When you make the PutItem request, make sure to generate a new sequence value.
Part 3 - Putting It All Together
In Part 1 and Part 2 we figured out how to keep track of likes and keep track of statics but there are big problems with our approach: performance would be pretty bad with any kind of real-life scale; hot items would create problems for us; updating the Top-N stats would be a significant bottleneck.
To improve performance and achieve some scalability we'd want to get away from updating each item and the item-stats for every single "like".
We can achieve a good balance of performance and scalability using a combination of queues + dynamodb + compute resource.
create a queue to store pending likes
let "likes API" would enqueue a message tagging a post with a like, instead of applying them as they come
implement a queue consumer (could be a Lambda, or some other periodically running process) to pull messages off the queue and aggregate likes per item, then update items and the item-stats
By batching updates, we can get control over concurrency (and cost) at the expense of latency/eventual consistency.
We may end up with a limited number of queue consumers, each processing items in batches. In each batch, multiple item likes would be aggregated and a single update per item would be applied. Similarly, a single item-stats update would be applied per batch processor.
Depending on volume of incoming likes, you may need to spin up more processors.

Google Analytics: How to properly filter ga:1dayUsers and ga:30dayUsers

Question: What is the right way to filter active users based on the presence of an event?
I'm trying to report on a count of users that have performed a particular action (purchased an item) on my site.
The aim is to have a Daily Unique Buyer (akin to DAU or 1dayUsers) and Monthly Unique Buyer (akin MAU or 30dayUser) metric.
For the Daily Unique Buyer metric I have tried two separate approaches and I am getting different results for both.
Approach 1) Use ga:Users metric and apply filter ga:eventCategory=="Purchase"
Approach 2) Create custom Segment, Ensure that Advanced Filter condition is for Users (not Sessions) and set the same filter ga:eventCategory=="Purchase"
The first approach seems to yield the desired result when compared to the second.
Unfortunately, the first approach does not extend to computing the same metric for Monthly Unique Buyers.
Most post on StackOverflow suggest that creating a segment (approach 2) is the right way forward. This however, yields more users than events, which can't be correct.
Even more perplexing - Applying the segment in Audience -> Active Users interface yields a different result to programmatic app-script query below
const optArgs = {
'dimensions': 'ga:date',
'sort': '-ga:date','
start-index': '1',
'max-results': 250,
'segment: 'gaid::xxxx',
}
Analytics.Data.Ga.get(
myViewId, startDate, endDate, 'ga:1dayUsers', optArgs
);
update: For those that struggled with this. I don't claim to understand why, but I was able to get the correct number by querying the desired metrics 1dayUsers and 30dayUsers one date at a time.
Running the report over a date range failed. I checked this with the list of actual active users (under User Explorer in the interface) and both 1 day and 30 day metrics are correct.
Would love for someone to explain why this is needed.

First get list of users from table-1, then compare with current user by specific field from table-2

What is the procedure to do if I have 2 tables: From table-1 I get all the users I want, and then, after I have the list of user ids, I want to compare each user with current user with a field found on table-2. The first task is easy, I just have an onDataChange that populates the users list with their ids. But now that I have this list, how to iterate each user, and compare it with the current user based on a specific field from table-2.
What I currently try is to use a for loop to iterate each user on the list with each having onDataChange call to table-2, and then I populate the necessary dataset. But when this for loop ends, this dataset is no longer visible.
I hope what I try to achieve in this post is understandable.
I'll try to demonsrate with tables:
Assuming I get user list from table-1 based on data1:
table-1
|
|_____data1
|____uid20
|____uid30
|____uid44
Now I have list of 3 users: uid20, uid30, uid44. Then, I need to compare the list of users, with current user, call it user1, from table-1, based on another field (timestamp). What I mean is, after I have list of users, I want to filter these to have a timestamp that's close to the current user, for up to certain amount of time. So in my example, I want to have only users that are within 2 minutes of the current user timestamp.
table-2
|
|______uid1
| |____timestamp: <some_timestamp>
|
|______uid20
| |____timestamp: <some_timestamp>
|
|______uid30
| |____timestamp: <some_timestamp>
|
|______uid44
|____timestamp: <some_timestamp>
But every time there is something that's out of scope of the new listener, and also it looks like it's not the right procedure. Maybe I first need to save what's found on table-1 locally ? Or, it can be done somehow purely with Firebase calls?
**This is some code:
Getting the current user, is easy:
final FirebaseUser user = FirebaseAuth.getInstance().getCurrentUser();
final String userId = user.getUid();
So I always have it visible at any scope
First, as I understand, you can use the startAt and endAt to get a range of the values within two minutes of difference. What I mean is that before getting each value from your table-2 you can just get the values that matches your use case, in this case, values that are 2 minutes of the current timestamp.
For example, in your table 2, I would query like this:
ref.orderByChild('timestamp').startAt(yourCurrentTimeStamp).endAt(yourCurrentTimeStamp+120000);
where 120000 is 2 seconds in miliseconds
and then when looping through this elements I would use getKey to get each key of the values filtered by this query, so I would get only the users with 2 minutes of difference, and then compare them with the first for loop you did in order to see if they match.
to compare 2 users ID you can use equals, since it's a String:
if(snapshotUserTable1.getKey().equals(snapshotUserTable2.getKey())){
/...
}

Firebase Cohorts in BigQuery

I am trying to replicate Firebase Cohorts using BigQuery. I tried the query from this post: Firebase exported to BigQuery: retention cohorts query, but the results I get don't make much sense.
I manage to get the users for period_lag 0 similar to what I can see in Firebase, however, the rest of the numbers don't look right:
Results:
There is one of the period_lag missing (only see 0,1 and 3 -> no 2) and the user counts for each lag period don't look right either! I would expect to see something like that:
Firebase Cohort:
I'm pretty sure that the issue is in how I replaced the parameters in the original query with those from Firebase. Here are the bits that I have updated in the original query:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
JOIN `dataset.app_events_*` AS questions
ON questions.user_dim.app_info.app_instance_id = answers.user_dim.app_info.app_instance_id
-- WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
(...)
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2017-11-01'))
ORDER BY cohort, period_lag, period_label
So I'm using user_dim.first_open_timestamp_micros instead of create_date and user_dim.app_info.app_instance_id instead of id and parent_id. Any idea what I'm doing wrong?
I think there is a misunderstanding in the concept of how and which data to retrieve into the activities table. Let me state the differences between the case presented in the other StackOverflow question you linked, and the case you are trying to reproduce:
In the other question, answers.creation_date refers to a date value that is not fix, and can have different values for a single user. I mean, the same user can post two different answers in two different dates, that way, you will end up with two activities entries like: {[ID:user1, date:2018-01],[ID:user1, date:2018-02],[ID:user2, date:2018-01]}.
In your question, the use of answers.user_dim.first_open_timestamp_micros refers to a date value that is fixed in the past, because as stated in the documentation, that variable refers to The time (in microseconds) at which the user first opened the app. That value is unique, and therefore, for each user you will only have one activities entry, like:{[ID:user1, date:2018-01],[ID:user2, date:2018-02],[ID:user3, date:2018-01]}.
I think that is the reason why you are not getting information about the lagged retention of users, because you are not recording each time a user accesses the application, but only the first time they did.
Instead of using answers.user_dim.first_open_timestamp_micros, you should look for another value from the ones available in the documentation link I shared before, possibly event_dim.date or event_dim.timestamp_micros, although you will have to take into account that these fields refer to an event and not to a user, so you should do some pre-processing first. For testing purposes, you can use some of the publicly available BigQuery exports for Firebase.
Finally, as a side note, it is pointless to JOIN a table with itself, so regarding your edited Standard SQL query, it should better be:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
GROUP BY id, period

Missing values in google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910

I am working with the practice repository in preparation for doing upcoming work with a large enterprise client using BQ. The repository link is: google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910
I have 3 questions to ask in relation to the sample repository & a query that was run (please see the bottom of the link for the query that motivated the question:
1) What is the difference between customDimensions.index, customDimensions.value and hits.customDimensions.index, hits.customDimensions.value?
2) If a single hit has multiple custom dimensions/metrics how is that returned/queried? I only see single dimensions matching at the hit level in the sample data.
3) There are no custom metric values passed in the example data, what will those values look like?
Here is the query that motivated the previous 3 questions:
SELECT hits.page.pagePath AS urls,
hits.time,
customDimensions.index,
customDimensions.value,
hits.customMetrics.index,
hits.customMetrics.value,
trafficSource.medium,
hits.customVariables.index,
hits.customVariables.customVarName,
hits.customVariables.customVarValue
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
Every record in that table represents one Google Analytics Session. Big Query has this concept of nested fields and that's how individual hits are defined. They are nested into the hits record.
Answering your questions:
1) customDimensions.index and customDimensions.value are the index and value for user or session scoped custom dimensions. hits.customDimensions.index and hits.customDimensions.value re custom Dimensions set at hit scope level. The scope is defined when you create the custom Dimension through GA interface. indexes are integers from 1 to 20 (as defined in the Admin section) and value is the string passed as the value for that custom Dimension. More info about Custom Dimensions/Metrics
2) Both rows and rows.customDimensions are REPEATED RECORDS in Big Query. So in essence every row in that BQ table looks like this:
|- date
|- (....)
+- hits
|- time
+- customDimensions
|- index
|- value
But when you query the data this should be FLATTEN by default. Because it's flatten if a single hit has multiple custom dimensions and metrics it should show multiple rows, one for each.
3) Should be the same as customDimensions but the values are INTEGER instead of STRINGS.
For a simpler and more educational dataset I suggest that you create a brand new BQ table and load the data provided on this developer document page.
PS: Tell my good friends at Cardinal Path that Eduardo said Hello!

Resources