How to use DynamoDB streams to maintain duplicated data consistency? - amazon-dynamodb

From what I understand one of the uses cases of DynamoDB Streams is to maintain/update duplicated data.
Let's say I have a User object, and its name attribute is replicated in many Invoice objects.
When a User edits/updates its name, I will have a lambda using DynamoDb Streams to then update all Invoices related to this user with his new name.
There could be thousands of Invoices related to this user so this updating could take a while, specially because I will want to do a rate limited batch_write so that this operation doesn't throttle my table.
The question is : How can my (web)application know that the lambda has finished updating? For example, I want to show a loading screen to the client using the application untill the duplicated data updating is done, so that he doesn't see any outdated information on his browser.
Or is there other ways of rapidly dealing with updating thousands of duplicated data?

Why aren't you capturing the output of Lambda. You can make Lambda return successful status, once all the updates are persisting to DDB.

Invoice can keep a reference to User object instead of storing the exact name and can fetch name at the time of generating/printing

Related

Bulk writes when one object requires the Key of another

Trying to figure out a good solution to this. Using Python and the NDB library.
I want to create an entity, and that entity is also tied to another entity. Both are created at the same time. Example would be creating a Message for a large number of users. We have an Inbox table/kind, and a Message table.
So once we gather the Keys of all the users we want, what I'm doing is just creating the Inbox entity, saving it, and then using the provided Key that it returns and attaching to the Message, and then saving the Message. For a large number of users, this seems pretty expensive. 2 writes per user. Normally I would just create the objects themselves and then use ndb.put_multi() to just batch the writes. Since there is no Key until it's saved, I can't do that.
Hope that made sense. Ideas?
Take a look at the allocate_ids API. You can pass a parent key and get id assigned. The allocate_ids call guarantees that the id is never reused within that parent key context. allocate_ids is a small operation and fast. Once you allocate these ids, then you can do the put_multi by referencing the allocated ids in the other entities that refer them. As I understand the message entity itself is not being referenced and if so you only need to allocate ids for Inbox (presumably if the user already doesn't have one) and do a multi_put on both inbox and message entities.

EmberFire Relationship Persistance

Using EmberFire, I'm trying to work with related sets of data. In this case, a Campaign has many Players and Players have many Campaigns.
When I want to add a player to campaign, I understand that I can push the player object to campaign.players, save the player, and then save the campaign. This will update both records so that the relationship is cemented. This works fine.
My question is a hypothetical on how to handle failures when saving one or both records.
For instance, how would you handle an occasion whereby saving the player record succeeded (thus adding a corresponding campaign ID to its campaigns field), but saving the campaign then failed (thus failing to add the player to its players field). It seems like in this case you'd open yourself up to the potential of some very messy data.
I was considering taking a "snapshot" of both records in question and then resetting them to their previous states if one update fails, but this seems like it's going to create some semi-nightmarish code.
Thoughts?
I guess you are using the Real Time Database. If you use the update() method with different paths "you can perform simultaneous updates to multiple locations in the JSON tree with a single call to update()"
Simultaneous updates made this way are atomic: either all updates
succeed or all updates fail.
From the documentation: https://firebase.google.com/docs/database/web/read-and-write#update_specific_fields
So, in your case, you should do something like the following but many variations are possible as soon as the updates object holds all the simultaneous updates:
var updates = {};
updates['/campaigns/players/' + newPlayerKey] = playerData;
updates['/players/campaigns/' + campaignKey] = "true";
return firebase.database().ref().update(updates);

Lookup the existence of a large number of keys (up to1M) in datastore

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits
I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Fetch new entities only

I thought Datastore's key was ordered by insertion date, but apparently I was wrong. I need to periodically look for new entities in the Datastore, fetch them and process them.
Until now, I would simply store the last fetched key and wrongly query for anything greater than it.
Is there a way of doing so?
Thanks in advance.
Datastore automatically generated keys are generated with uniform distribution, in order to make search more performant. You will not be able to understand which entity where added last using keys.
Instead, you can try couple of different approaches.
Use Pub/Sub and architecture your app so another background task will consume this last added entities. On entities add in DB, you will just publish new Event into Pub/Sub with key id. You event listener (separate routine) will receive it.
Use names and generate you custom names. But, as you want to create sequentially growing names, this will case performance hit on even not big ranges of data. You can find more about this in Best Practices of Google Datastore.
https://cloud.google.com/datastore/docs/best-practices#keys
You can add additional creation time column, and still use automatic keys generation.

Efficient DynamoDB schema for time series data

We are building a conversation system that will support messages between 2 users (and eventually between 3+ users). Each conversation will have a collection of users who can participate/view the conversation as well as a collection of messages. The UI will display the most recent 10 messages in a specific conversation with the ability to "page" (progressive scrolling?) the messages to view messages further back in time.
The plan is to store conversations and the participants in MSSQL and then only store the messages (which represents the data that has the potential to grow very large) in DynamoDB. The message table would use the conversation ID as the hash key and the message CreateDate as the range key. The conversation ID could be anything at this point (integer, GUID, etc) to ensure an even message distribution across the partitions.
In order to avoid hot partitions one suggestion is to create separate tables for time series data because typically only the most recent data will be accessed. Would this lead to issues when we need to pull back previous messages for a user as they scroll/page because we have to query across multiple tables to piece together a batch of messages?
Is there a different/better approach for storing time series data that may be infrequently accessed, but available quickly?
I guess we can assume that there are many "active" conversations in parallel, right? Meaning - we're not dealing with the case where all the traffic is regarding a single conversation (or a few).
If that's the case, and you're using a random number/GUID as your HASH key, your objects will be evenly spread throughout the nodes and as far as I know, you shouldn't be afraid of skewness. Since the CreateDate is only the RANGE key, all messages for the same conversation will be stored on the same node (based on their ConversationID), so it actually doesn't matter if you query for the latest 5 records or the earliest 5. In both cases it's query using the index on CreateDate.
I wouldn't break the data into multiple tables. I don't see what benefit it gives you (considering the previous section) and it will make your administrative life a nightmare (just imagine changing throughput for all tables, or backing them up, or creating a CloudFormation template to create your whole environment).
I would be concerned with the number of messages that will be returned when you pull the history. I guess you'll implement that by a query command with the ConversationID as the HASH key and order results by CreationDate descending. In that case, I'd return only the first page of results (I think it returns up to 1MB of data, so depends on an average message length, it might be enough or not) and only if the user keeps scrolling, fetch the next page. Otherwise, you might use a lot of your throughput on really long conversations and anyway, the client doesn't really want to get stuck for a long time waiting for megabytes of data to appear on screen..
Hope this helps

Resources