Why order is not working on /v2/shares endpoint - linkedin

At hootsuite.com we are using v2/shares to create reports for multiple social profiles over large periods of time.
The documentation for that endpoint specifies here that: "Shares are ordered by creation time".
at the moment, when I go to https://api.linkedin.com/v2/shares?q=owners&owners=urn:li:organization:15100279&sharesPerOwner=500&start=290
I'll see "activity": "urn:li:activity:6537431951580684288" with createdTime 1558645236755 between two other shares that both have bigger createTime (1559217643294 and 1559131242301)
This means that the creation time is not the order.
If the order has been changed on lastModified time, please suggest a method to paginate through shares up to a last known share fetched 1 day ago, while guaranteeing no new shares were missed.

Related

How to design a recommendation system in DynamoDb based on likes

Considering performance as main importance, how would be the best design approach to follow in order to build a recommendation system in DynamoDb?
The system would need to store an url and the numbers of times that topic was 'liked', but my requirement includes the need of searches by daily, weekly, monthly and yearly, for e.g.:
Give me the top 10 of the week
Give me the top 10 of the month
I was thinking about to include the date and time information, so that the query could control it through this field, but not sure wether it is the good in terms of performance.
If the only data structure you had was a hash map, how would you solve this problem?
What if on top of that constraint, you could only update any key up to 1000 times per second, and read a key up to 3000 per second?
How often do you expect your items to get liked? Presumably there will be some that will be hot and liked a lot, while others would almost never get any likes.
How real_time does your system need to be? Can the system be eventually consistent (meaning, would it be ok if you only reported likes as of several minutes ago)?
Let's give this a shot
Disclaimer: this is very much a didactic exercise -- in practice you may want to explore an analytics product, or some other technologies than DynamoDB to accompish this task
Part 1. Representing an Item And Updating Like Counts
First, let's talk about your aggregation/analytics goals: you mentioned that you want to query for "top 10 of the week" or "top 10 of the month" but you didn't specify if that is supposed to mean "calendar week"/"calendar month", or "last 7 days"/"last 30 days".
I'm going to take it literally and assume that "top 10 of the week" means top 10 items from this week that started on the most recent Monday (or Sunday if you roll that way). Same for month: "top 10 of the month" means "top 10 items since the beginning of this month.
In this case, you will probably want to store, for each item:
a count of total all-time likes
a count of likes since the beginning of current month
a count of likes since the beginning of current week
current month number - needed to determine if we need to reset
current week number - needed to determine if we need to reset
And each week, reset the counts for the current week; And each month reset the counts for the current month.
In DynamoDB, this might be represented like so:
{
id: "<item-id>",
likes_all: <numeric>, // total likes of all time
likes_wk: <numeric>, // total likes for the current week
likes_mo: <numeric>, // total likes for the current month
curr_wk: <numeric>, // number of the current week of year, eg. 27
curr_mo: <numeric>, // number of the current month of year, eg. 6
}
Now, you can update the number of likes with an UpdateItem operation, with an UpdateExpression, like so:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc, likes_wk = likes_wk + :lc, likes_mo = likes_mo + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This gives you a simple atomic way to increment the counts and get back the updated values. Notice the :lc value can be any number (not just 1). This will come in handy below.
But there's a catch. You also need to be able to reset the counts if the week or month rolled over, so to do that, you can break the update into two operations:
update the total count (and get the most recent values back)
conditionally update the week and month counts
So, our update sequence becomes:
Step 1. update total count and read back the updated item:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET likes_all = likes_all + :lc" \
--expression-attribute-values '{":lc": {"N":"1"}}' \
--return-values ALL_NEW
This updates the total count and gives us back the state of the item. Based on the values of the curr_wk and curr_mo, you will have to decide what the update looks like. You may be either incrementing, or setting an absolute value. Let's say we're in the case when the update is being performed after the week rolled over, but not the month. And let's say that the result of the update above looks like this:
{
id: "<item-id>",
likes_all: 1000, // total likes of all time
likes_wk: 70, // total likes for the current week
likes_mo: 150, // total likes for the current month
curr_wk: 26, // number of the week of last update
curr_mo: 6, // number of the month of year of last update
}
curr_wk is 6, but at the time of update, the actual current week should be 7.
Then your update query would look look like this:
dynamodb update-item \
--table-name <your-table-name> \
--key '{"id":{"S":"<item-id>"}}' \
--update-expression "SET curr_wk = 27, likes_wk = :lc, likes_mo = likes_mo + :lc" \
--condition-expression "curr_wk = :wk AND curr_mo = :mo" \
--expression-attribute-values '{":lc": {"N":"1"}, ":wk": {"N":"26"}, ":lc": {"N":"6"},}' \
--return-values ALL_NEW
The ConditionExpression ensures that we don't reset the likes twice, if two conflicting updates happen at the same time. In that case, one of the updates would fail and you'd have to switch the update back to an increment.
Part 2 - Keeping Track of Statistics
To take care of your statistics you need to keep track of most likes per week and per month.
You can keep a sorted list of hottest items per week and per month. You also can store these lists in Dynamo.
For example, let's say you want to keep track of top 3. You might store something like:
{
id: "item-stats",
week_top: ["item3:4000", "item2:2000", "item9:700"],
month_top: ["item2:100000", "item4:50000", "item3:12000"],
curr_wk: 26,
curr_mo: 6,
sequence: <optimistic-lock-token>
}
Whenever you perform an update for items, you would also update the statistics.
The algorithm for updating statistics will be similar to updating an item, except you can't just use update expressions. Instead you have to implement your own read-modify-write sequence using GetItem, PutItem and ConditionExpression.
First, you read the current values for the item-stats special item, including the value of the current sequence (this is important to detect clobbering)
Then, you figure out if the item(s) you've just updated counts for would make it into the Top-N weekly or monthly list. If so, you would update the week_top and/or month-top attributes and prepare a conditional PutItem request.
The PutItem request must include a conditional check that verifies the sequeuce of the item-stats is the same as what you read earlier. If not, you need to read the item again and re-compute the top-N lists, then attempt to put again.
Also, similar to the way the counts get reset for items, when an update happens you need to check and see if the weekly or monthly top needs to be reset as part of the update.
When you make the PutItem request, make sure to generate a new sequence value.
Part 3 - Putting It All Together
In Part 1 and Part 2 we figured out how to keep track of likes and keep track of statics but there are big problems with our approach: performance would be pretty bad with any kind of real-life scale; hot items would create problems for us; updating the Top-N stats would be a significant bottleneck.
To improve performance and achieve some scalability we'd want to get away from updating each item and the item-stats for every single "like".
We can achieve a good balance of performance and scalability using a combination of queues + dynamodb + compute resource.
create a queue to store pending likes
let "likes API" would enqueue a message tagging a post with a like, instead of applying them as they come
implement a queue consumer (could be a Lambda, or some other periodically running process) to pull messages off the queue and aggregate likes per item, then update items and the item-stats
By batching updates, we can get control over concurrency (and cost) at the expense of latency/eventual consistency.
We may end up with a limited number of queue consumers, each processing items in batches. In each batch, multiple item likes would be aggregated and a single update per item would be applied. Similarly, a single item-stats update would be applied per batch processor.
Depending on volume of incoming likes, you may need to spin up more processors.

Google Analytics: How to properly filter ga:1dayUsers and ga:30dayUsers

Question: What is the right way to filter active users based on the presence of an event?
I'm trying to report on a count of users that have performed a particular action (purchased an item) on my site.
The aim is to have a Daily Unique Buyer (akin to DAU or 1dayUsers) and Monthly Unique Buyer (akin MAU or 30dayUser) metric.
For the Daily Unique Buyer metric I have tried two separate approaches and I am getting different results for both.
Approach 1) Use ga:Users metric and apply filter ga:eventCategory=="Purchase"
Approach 2) Create custom Segment, Ensure that Advanced Filter condition is for Users (not Sessions) and set the same filter ga:eventCategory=="Purchase"
The first approach seems to yield the desired result when compared to the second.
Unfortunately, the first approach does not extend to computing the same metric for Monthly Unique Buyers.
Most post on StackOverflow suggest that creating a segment (approach 2) is the right way forward. This however, yields more users than events, which can't be correct.
Even more perplexing - Applying the segment in Audience -> Active Users interface yields a different result to programmatic app-script query below
const optArgs = {
'dimensions': 'ga:date',
'sort': '-ga:date','
start-index': '1',
'max-results': 250,
'segment: 'gaid::xxxx',
}
Analytics.Data.Ga.get(
myViewId, startDate, endDate, 'ga:1dayUsers', optArgs
);
update: For those that struggled with this. I don't claim to understand why, but I was able to get the correct number by querying the desired metrics 1dayUsers and 30dayUsers one date at a time.
Running the report over a date range failed. I checked this with the list of actual active users (under User Explorer in the interface) and both 1 day and 30 day metrics are correct.
Would love for someone to explain why this is needed.

Firebase Cohorts in BigQuery

I am trying to replicate Firebase Cohorts using BigQuery. I tried the query from this post: Firebase exported to BigQuery: retention cohorts query, but the results I get don't make much sense.
I manage to get the users for period_lag 0 similar to what I can see in Firebase, however, the rest of the numbers don't look right:
Results:
There is one of the period_lag missing (only see 0,1 and 3 -> no 2) and the user counts for each lag period don't look right either! I would expect to see something like that:
Firebase Cohort:
I'm pretty sure that the issue is in how I replaced the parameters in the original query with those from Firebase. Here are the bits that I have updated in the original query:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
JOIN `dataset.app_events_*` AS questions
ON questions.user_dim.app_info.app_instance_id = answers.user_dim.app_info.app_instance_id
-- WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%'
(...)
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2017-11-01'))
ORDER BY cohort, period_lag, period_label
So I'm using user_dim.first_open_timestamp_micros instead of create_date and user_dim.app_info.app_instance_id instead of id and parent_id. Any idea what I'm doing wrong?
I think there is a misunderstanding in the concept of how and which data to retrieve into the activities table. Let me state the differences between the case presented in the other StackOverflow question you linked, and the case you are trying to reproduce:
In the other question, answers.creation_date refers to a date value that is not fix, and can have different values for a single user. I mean, the same user can post two different answers in two different dates, that way, you will end up with two activities entries like: {[ID:user1, date:2018-01],[ID:user1, date:2018-02],[ID:user2, date:2018-01]}.
In your question, the use of answers.user_dim.first_open_timestamp_micros refers to a date value that is fixed in the past, because as stated in the documentation, that variable refers to The time (in microseconds) at which the user first opened the app. That value is unique, and therefore, for each user you will only have one activities entry, like:{[ID:user1, date:2018-01],[ID:user2, date:2018-02],[ID:user3, date:2018-01]}.
I think that is the reason why you are not getting information about the lagged retention of users, because you are not recording each time a user accesses the application, but only the first time they did.
Instead of using answers.user_dim.first_open_timestamp_micros, you should look for another value from the ones available in the documentation link I shared before, possibly event_dim.date or event_dim.timestamp_micros, although you will have to take into account that these fields refer to an event and not to a user, so you should do some pre-processing first. For testing purposes, you can use some of the publicly available BigQuery exports for Firebase.
Finally, as a side note, it is pointless to JOIN a table with itself, so regarding your edited Standard SQL query, it should better be:
#standardSQL
WITH activities AS (
SELECT answers.user_dim.app_info.app_instance_id AS id,
FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
FROM `dataset.app_events_*` AS answers
GROUP BY id, period

Dynamodb data model for process/transaction monitoring

I am wanting to keep track of multi stage processing job.
Likely just need the following fields
batchId (guid) | eventId (guid) | statusId (int) | timestamp | message (string)
There are relatively small number of events per batch.
I want to be able to easily query events that have a statusId less than n (still being processed or didn't finish processing).
Would using multiple rows for each status change, and querying for latest status be the best approach? I would use global secondary index but StatusId does not seem like a good candidate for hashkey (less than 10 statuses).
Instead of using multiple rows for every status change, if you updated the same event row instead, you could use a technique described in the DynamoDB documentation in the section 'Use a Calculated Value'. Basically this would involve adding another attribute (say 'derivedStatusId') which would be derived by appending a random number to statusId at the time of writing to DynamoDB. For example, for a statusId of 2, derivedStatusId could be one of {"2-00", "2-01", .. "2-99"}. Setting up a Global Secondary Index on derivedStatusId would give you some fan-out that will help in preventing the index from becoming hot.
If you are sure that you will use this index for only unfinished events, then removing the derivedStatusId attribute from the record when it transitions to a finished status will remove it from index as well - which may be a good property if events are expected to finish processing eventually, and if they stay around forever. This technique is called "Sparse Index" and is described in more detail here.
From your question, it seems like keeping status history recording is a desired property (I assume this because you want to have multiple rows for status changes). Consider putting this historical information in the same row. DynamoDB supports list data types and also has a generous 400KB item limit which may just allow you to capture all the desired historical information in the same record.

Amazon MWS API, listing seller's products that is updated after a specific date

I am trying to list the products of a seller (using marketplaceID) that is created or updated after a specific date.
I tried RequestReport with ReportType "_GET_MERCHANT_LISTINGS_DATA_" and setting StartDate to the target date but the data returned contains products that are created (or lastly updated) before that date.
https://developer.amazonservices.com/
The documentation is not very specific on what 'StartDate' actually does:
Start of a date range used for selecting the data to report.
Type: xs:datetime
Default: Now
If I recall correctly, this date does not relate to a products modification timestamps but to a products existance in the database. As an example, setting StartDate to yesterday should give you a list of products that were in the database within the last 24 hours. This includes products that were recently created and products that were created way before that but still exist.
I don't think it is possible to get a list of products that were modified within a timeframe (again, I'm writing this from my recollection on how this worked when I played with it)

Resources