Fetching parent and child item in single query in DynamoDB - amazon-dynamodb

I have the following one-to-many relationship:
Account 1--* User
The Account contains global account-level information, which is mutable.
The User contains user-level information, which is also mutable.
When the user signs-in, they need both Account and User information. (I only know the UserId at this point).
I ideally want to design the schema such that a single query is necessary. However, I cannot determine how to do this without duplicating the Account into each User and thus requiring some background Lambda job to propagate changes to Account attributes across all User objects -- which, for the record, seems like more resource usage (and code to maintain) than simply normalizing the data and having 2 queries on each sign-in: fetch user, then fetch account (using an FK inside the user object that identifies the account).
Is it possible to design a schema that allows one query to fetch both and doesn't require a non-transactional background job to propagate updates? (Transactional batch updates are out of the question, since there's >25 users.) And if not, is the 2-query idea the best / an acceptable method?

I'll focus on one angle in your question - the 2-query idea. In many cases it is indeed an acceptable method, better than the alternatives. In fact in many NoSQL uses, every user-visible request results in significantly more than two database requests. In fact, it is often stated that this is the reason why NoSQL systems care about low tail latencies (i.e., even 99th percentile latencies should be low).
You didn't say why you wanted to avoid the 2-query solution. The 2-query implementation you presented has two downsides:
It is more costly: you need to do two queries instead of one, costing (when the reads are shorter than 4 KB) double than a single read.
Latency doubles if you need to do the first query, and only then can do the second query.
There may be tricks you can use to solve both problems, depending on more details of your use case:
For the latency: You didn't say what is a "user id" in your application. If it is some sort of unique numeric identifier, maybe it can be set up such that the account id can be determined from the user id directly, without a table lookup (e.g., the first bits of the user id are the account id). If this is the case, you can start both lookups at the same time, and not double the latency. The cost will still be double, but not the latency.
For the cost: If there is a large number of users per account (you said there are more than 25 - I don't know if it's much more or not), it may be useful to cache the Account data, so that not every user lookup will need to read the Account data again - it might often be cached. If Account information rarely changes and consistency of it is not a big deal (I don't know if it is...), you can also get by with doing an "eventual consistency" read for the Account information - which costs half of the regular "consistent" read.

I think the following scheme will be useful for.
You will store both account and user records inthe same table
You want to get both account metadata and linked users in a single query
PK: account SK: recordId
=== Account record ===
account: 123512321 recordId: METADATA attributes: name, environment, ownerId...
=== User record ===
account: 123512321 recordId: USERID#34543543 attributes: name, email, phone...
With this denormalization of the data, you can retrieve both account metadata and related users in a single query. You can also change the account metadata without a need to apply any change to related users.
BONUS: you can also link other types of assets to the account record

Related

How to have realtime many to many two way references in firestore

I have a user collection in firestore and each user object has an array of references to "tasks" that they have applied to. Tasks is a separate collection as well and each task object has a user ref array as well.
Collection Tasks:
doc: {
name: "Do something",
time: "Time",
users: ["/users/u1", "/users/u2"]
}
Collection Users:
doc: {
name: Username,
tasks: [ "/tasks/docRef", "/tasks/anotherDoc" ]
}
I have a screen in my react-native app that lists all the tasks and when a task is clicked, it goes to a details screen that displays all the users in a list as well.
Is this the best approach to have this kind of data? Or should I have collections instead of arrays with references. I refrained from collections to prevent duplication of data but I'm not sure if they will be more efficient.
(From the comments )I wanted to inquire if this was the right approach to store the data?
1/ Should I just use uids and then query the collection to matching
those ids?
Storing uid or DocumentReferences in the arrays will not make a difference in terms of ease of querying the corresponding documents. It would only make a difference at the level of the size of the document containing this data since DocumentReference`s are longer than uids).
2/ Or create a sub-collection in the user document itself with references to the tasks?
In the NoSQL world you should not hesitate to denormalize your data.
So having the tasks list in a user doc AND having the users list in a task doc and synchronize the docs when a change is done in one of the collections is a valid approach.
HOWEVER, you may encounter a problem if you have "a lot" of tasks for a given user or a lot of users involved in a task since you may hit the maximum size for a document which is 1 MiB (I agree that you need a LOT of tasks or users :-)).
To avoid that I would advise using sub-collections. This is also the preferred approach if you plan a high frequency of changes that could cause database contention or higher latency, see the documentation section about "Designing for scale".
If the user data you want to show in a task is limited (e.g. just their name plus a button to open each user Profile based on the uid) I would keep an array of users in the task document with this limited amount of data (of course after being sure that there is no risk that a doc reaches the limit of 1 MiB). And have a sub-collection of task documents under each user doc (as advised above).
Modeling well in firestore is a bit difficult, you have to think hard about your use case.
Don't worry about data duplication, this is very common in Firestore. Remembering that you mainly pay for the number of reads and writes.
In your user array, you could keep the necessary data to save you from doing more reads on the user collection.
In your example, the user only has the username, you could keep the username and uid, saved in tasks. That way, no reads would be done on the user collection.
What if the user changes username? Use a batched writes and update all docs that contain that user.

How to structure Firestore Security Rules & Data Structure for granular access

I am building a community-type app based on Firestore where users should have granual control over what kind of information they share with whom.
Users can have properties such as name, birthdate, etc. and for each of them they can decide to share it with the one of the following groups/roles:
Private
Contacts
Admin (Admins of organizations that user is a member of)
Organization (Members of organizations that a users is a member of)
Public (All users of the app)
As documents in Firestore will always be retrieved as a whole, I already know that I somehow will have to segregate my user properties by access level.
I've got two approaches so far:
Approach 1
Store each user property in a separate document that contains a field access level
Store some metadata in, for example /user/12345/meta/roles, so that I can point the security rules to those documents to validate access
Benefits:
Easy structure
Flexibly
(Almost) no data duplication
Drawbacks:
Lots of document reads for getting a user's profile
Approach 2
Store user profile in, for example /user/12345/profile/private and duplicate the public information into /user/12345/profile/public, and do the same for each access level
Benefits:
Reduced document reads
Drawbacks:
Complexity
It feels wrong to duplicate that much data
Does anyone have any experience with this and any suggestions or alternative approaches they can share?
Follow-up question:
Let’s say I store the list of members of an organization in a subcollection, that is only accessible for members of the organization (for privacy reasons). Doesn’t that mean that when querying that list of members from client side, I have to do it „blindly“, meaning I can’t know if the user can access that document until I actually try? The fact that the query might fail would tell me that the user is not actually a member of that organization.
Would you consider this kind of query that is set up for failure bad practice? Are there any alternatives that still allow to keep the memberlist private?
I think you are moving from a SQL environment to NoSql now which is why you are finding the Approach 2 as not the right way to proceed.
Actually approach 2 is the right way to proceed there are couple of advantages
1.) Reduced Document Reads - More cost savings. Firestore charges by number of reads and writes if you are reducing no of reads and writes optimally its always the way to go for. Also the cost of storage due is increased reads will always be less than the actual cost of reads if you are scaling up your application.
2.) In NoSql database your are allowed to duplicate data provided it is going to increase the read / search speed from the database.
I am not seeing the second approach as complex because that's the tradeoff you are making when Choosing a NoSql over Sql

Humanly readable keys for documents or collection

I have searched throughout stackoverflow looking for a way to generate numerical keys or any type of keys that are readable for the end user.
I have found multiple answers saying (you shouldn't). I get it .. but what's the alternative..
Imagine a customer having an issue regarding an Order for instance and having to spell the uid 1UXBay2TTnZRnbZrCdXh to your call center?
It's usually a good idea to disassociate keys from the data they contain. The data can change, usernames, passwords, locations etc. That kind of data is very dynamic. However, links and references are more static in nature.
Suppose you have a list of followers and you're using their username as a key. If a user changes his username, not only will their entire node have the be deleted and re-written, every other occurance of that key in the database would have the changed as well. Wheras, if the key is static, the only item that changes in the child username.
So to answer the question: here's one option
orders
firebase_generated_key_0
order_number: "1111"
ordered_by: "uid_0"
order_amount: "$99.95"
firebase_generated_key_1
order_number: "2222"
ordered_by: "uid_1"
order_amount: "$12>95"
With this structure you have the order number, a link to the user that ordered it and the total amount of the order. If the customer changes what's on the order, a simple change the order_amount is done and the order stays in place.
Edit:
A comment/question asked about race conditions when writing data with Firebase. There are a number of solutions but a good starting point is with Firebase Transactions to essentially 'lock' data to prevent concurrent modifications.
See Save data as transactions for further reading.

Lookup the existence of a large number of keys (up to1M) in datastore

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits
I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Efficient DynamoDB schema for time series data

We are building a conversation system that will support messages between 2 users (and eventually between 3+ users). Each conversation will have a collection of users who can participate/view the conversation as well as a collection of messages. The UI will display the most recent 10 messages in a specific conversation with the ability to "page" (progressive scrolling?) the messages to view messages further back in time.
The plan is to store conversations and the participants in MSSQL and then only store the messages (which represents the data that has the potential to grow very large) in DynamoDB. The message table would use the conversation ID as the hash key and the message CreateDate as the range key. The conversation ID could be anything at this point (integer, GUID, etc) to ensure an even message distribution across the partitions.
In order to avoid hot partitions one suggestion is to create separate tables for time series data because typically only the most recent data will be accessed. Would this lead to issues when we need to pull back previous messages for a user as they scroll/page because we have to query across multiple tables to piece together a batch of messages?
Is there a different/better approach for storing time series data that may be infrequently accessed, but available quickly?
I guess we can assume that there are many "active" conversations in parallel, right? Meaning - we're not dealing with the case where all the traffic is regarding a single conversation (or a few).
If that's the case, and you're using a random number/GUID as your HASH key, your objects will be evenly spread throughout the nodes and as far as I know, you shouldn't be afraid of skewness. Since the CreateDate is only the RANGE key, all messages for the same conversation will be stored on the same node (based on their ConversationID), so it actually doesn't matter if you query for the latest 5 records or the earliest 5. In both cases it's query using the index on CreateDate.
I wouldn't break the data into multiple tables. I don't see what benefit it gives you (considering the previous section) and it will make your administrative life a nightmare (just imagine changing throughput for all tables, or backing them up, or creating a CloudFormation template to create your whole environment).
I would be concerned with the number of messages that will be returned when you pull the history. I guess you'll implement that by a query command with the ConversationID as the HASH key and order results by CreationDate descending. In that case, I'd return only the first page of results (I think it returns up to 1MB of data, so depends on an average message length, it might be enough or not) and only if the user keeps scrolling, fetch the next page. Otherwise, you might use a lot of your throughput on really long conversations and anyway, the client doesn't really want to get stuck for a long time waiting for megabytes of data to appear on screen..
Hope this helps

Resources