Do DocumentDB Subqueries Perform Well Enough to do RDMS Style Joins?

Do DocumentDB Subqueries Perform Well Enough to do RDMS Style Joins? - azure-cosmosdb

DocumentDB has it's strengths. I think most would agree that creating associations between documents is not one of those strengths.
From what I read, the common strategy is to keep your data as denormalized as possible and custom logic around updating denormalized data when the original changes.
But what if you need your data to be normalized in some places? Lets say I have People and IceCreamFlavors. a Person has a FavorityIceCreamFlavor that is an id to an IceCreamFlavor.
From what I understand, if I need to get the IceCreamFlavor for this person, I'll need to run a second query to fetch the IceCreamFlavor.
(single collection documentdb)
SELECT * FROM c c.id = "person-1"
{
"firstName": "John",
"lastname": "Doe",
"favorityIceCreamFlavor": "icecream-4"
}
Fetch IceCreamFlavor-
select * From c WHERE c.id = "icecream-4"
{
"name": "Chocolate"
}
Combine Objects-
{
"firstName": "John",
"lastname": "Doe",
"favorityIceCreamFlavor": {
"name": "Chocolate"
}
}
Obviously not ideal, but if i'm looking a persons profile this isn't the worst. Additionally, with this flavor of document storage (documentdb), I can create stored procedures so I can do this sub-query server side.
But what if I'm an administrator and I want to see all of my users and their favorite ice creams?
This is starting to look like a problem. It looks like I have to do 11 sub-queries to fetch each users ice cream flavor.
This may simply be a problem that document storage cannot handle efficiently. But i'm making that assumption- I don't know the how documentdb works under the hood.
Should I be concerned about documentdb doing a query for each record here in a stored procedure?
Do DocumentDB Subqueries Perform Well Enough to do RDMS Style Joins?

A database needs to do two queries to accomplish a join. Now both of them can be in cache either for just the indexes or in some cases the entire operation. Also, that work is done in the same memory space and very close to the data in such a way that throughput constraints don't come into play.
DocumentDB/CosmosDB has something very close to that if you can do both queries in a stored procedure. You can only do that if both sets are in the same partition and they can be accomplished before the sproc times out (happens between 5K and 20K docs retrieved on large databases), but if you can use a stored procedure, then you are in the same memory space and very close to the data. The difference in latency between an SQL join and two round trips in DocumentDB/CosmosDB sproc will be minimal, single digit milliseconds on a database of 100k documents where your query only pulls back 100's of documents in my estimation.
A couple of other downsides to using sprocs for queries I should mention: 1) It can consume more RUs depending upon the complexity of your join logic, and 2) Sprocs execute in isolation which can limit concurrency and reduce the overall throughput of the system. On the other hand, you get guaranteed ACID consistency even when one of the other less strong consistency models is in effect for non-sproc queries.
If you can't use a sproc because of the reasons discussed above, then you'll need to pull the data back across the wire for the first query before composing the second one. In this case, you will run into throughput constraints and additional latency. How much depends upon a lot parameters. Using an app server in the same data center as the DocumentDB/CosmosDB partition holding the data will keep this to a minimum, but even that will still come with a penalty. It sill may be milliseconds difference depending but it will have an effect. If you have to leave the data center with the first round before composing the second, the effect will be even greater.
Every application is different but for typical OLTP traffic, I've always been able to get the performance I've needed out DocumentDB. And even heavy analytical loads can work especially if you are careful with partition key choice to get sufficient parallelism. I suggest you give it a try with a simple experiment close to your intended final product and see how it does.
Hope this helps.

Related

Can DynamoDB be used for this simple problem?

I am trying to understand the limitations of DynamoDB/NoSQL, mostly as a learning exercise. I came across a problem that is fairly simple in a relational database, but I cannot figure out how to accomplish it in DynamoDB even with full control of rebuilding the tables and indexes.
Problem: Every day everyone in an office chooses one fruit for lunch. At the end of the week, I just want a list of everyone who ate both an apple and a banana.
Example Data
I thought employee name should be the PK, day of the week should be the SK.. and Fruit would be an attribute. But that doesn't seem to work, because you cant query against an attribute.
Is there a way to structure the data to make this work? Is there another tool like OpenSearch, HiveQL, GraphQL that can help me do what i am trying to do here?
Thanks.

When you say it's "fairly simple in a relational database", what you mean is it's simple to express, not exactly simple to compute. You're pushing a lot of list intersection work to the database. As your data set grows, the response time for your query will get slower and slower. At some point the database will no longer be able to give you the answer. And while it's consuming CPU (before timing out) you're negatively impacting the load on the relational database server for other users.
With DynamoDB you can't express queries that take unbounded effort to compute or that depend so much on total data set size for their performance characteristics. You have to design a query system up front that doesn't get exponentially slower as the data set grows.
The DynamoDB design then depends on what you know up front. For example, do you know it's always the intersection of an apple and banana? Then during insert of a new food note if the person ate both, and mark them as such on a user metadata item. Use that marker later during the query phase.
Sound like a nuisance? Well, if your data set isn't growing large and/or you don't need reliably fast query performance, then a relational database solves this problem well. Different databases for different purposes.

DynamoDB also supports SCAN and not only QUERY.
A simple design for the table is to have the PK to be the name of the person, and the attributes will be the numeric values of the fruits that you can increase every day.
UPDATE "FRUIT_COUNTS"
SET BANANA=BANANA + 1
WHERE Employee='Bob'
Then, at the end of the week, you can run a simple PartiQL query on the table:
SELECT * FROM "FRUIT_COUNTS"
WHERE BANANA > 0 AND APPLE > 0

Storing and querying for announcements between two datetimes

Background
I have to design a table to store announcements in DynamoDB. Each announcement has the following structure:
{
"announcementId": "(For the frontend to identify an announcement to the backend)",
"author": "(id of author)",
"displayStartDatetime": "",
"displayEndDatetime": "",
"title": "",
"description": "",
"image": "(A url to an image)",
"link": "(A single url to another page)"
}
As we are still designing the table, alterations to the structure are permitted. In particular, announcementId, displayStartDatetime and displayEndDatetime can be changed.
The main access pattern is to find the current announcements. Users have a webpage which they can see all current announcements and their details.
Every announcement has a date for when to start showing it (displayStartDatetime) and when to stop showing it (displayEndDatetime). The announcement is should still be kept in the table after the current datetime is past displayEndDatetime for reference for admins.
The start and end datetime are precise to the minute.
Problem
Ideally, I would like a way to query the table for all the current announcements in one query.
However, I have come to the conclusion that it is impossible to fuse two datetimes in one sort key because it is impossible to order two pieces of data of equal importance (e.g. storing the timestamps as a string will mean one will be more important/greater than the other).
Hence, as a compromise, I would like to sort the table values by displayEndDatetime so that I can filter out past announcements. This is because, as time goes on, there will be more past announcements than future announcements, so it will be more beneficial to optimise that.
Compromised Solution
Currently, my (not very good) solutions are:
Use one "hot" partition key and use the displayEndDatetime as the sort key.
This allows me to filter out past announcements, but it also means that all the data is in a single partition. I could run a scheduled job every now and then to move the past announcements to a different spaced out partitions.
Scan through the table
I believe Scan will look at every item in the table before it performs any filtering. This solution doesn't seem as good as 1. but it would be the simplest to implement and it would allow me to keep announcementId as the partition key.
Scan a GSI of the table
Since Scan will look through every item, it may be more efficient to create a GSI (announcementId (PK), displayEndDatetime (SK)) and scan through that to retrieve all the announcementIds which have not passed. After that, another request could be made to get all the announcements.
Question
What is the most optimised solution for storing all announcements and then finding current announcements when using DynamoDB?
Although I have listed a few possible solutions for sorting the displayEndDatetime, the main point is still finding announcements between the start and end datetime.
Edit
Here are the answers to #tugberk's questions on the background:
What is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
I am uncertain of how the admins will use this system, announcements can be very regular (about 3/day) or very infrequent (about 3/month).
How much new data do you anticipate storing daily, and how do you think this will grow?
As mentioned above, this could be about 3 announcements a day or 3 a month. This is likely to remain the same for as long as I should be concerned about.
What is the rate of reads (e.g. peak reads per second)?
I would expect the peak reads per second to be around 500-1000 reads/s. This number is expected to grow as there are more users.
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most).
I would expect the maxmimum number of viewable announcements to be up to 30-40. This is because there could be multiple long-running announcements along with short-term announcements. On average, I would expect about 5-10 announcements.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
I think the speed which the announcement starts showing is important, especially if the admins decide that this is a good platform for urgent announcements (likely urgent to the minute). However, when it stops showing is less important, but to avoid confusing the users the announcement should stop display at most 4 hours after it is past its display end datetime.

This type of questions are always hard to answer here as there is so many assumptions on the answer as it's really hard to have all the facts. But I will try to give you so ideas, which may help you think about your data storage choice as well as giving you further options.
I know what I am doing, and really need to use DynamoDB
Edited this answer based on the OP's answers to my original questions.
As you really need to us DynamoDB for this for internal reasons, I think it's more suitable to store the data in two DynamoDB tables for both serving reads and writes as nearly all access patterns I can think of will hit multiple partitions if you have one table. You can get away with a GSI, but it's not too straight forward how to do it, and I am not sure whether there is any advantage to doing it that way.
The core thing you need to optimize for is the reads as you mentioned it can go up to 2K/rps which is big enough to make this the part where you optimize your architecture against. Based on your assumptions of having 3 announcements a day, it's nothing to worry about as far as the writes are concerned.
General idea is this:
I would consider using one DynamoDB table to handle writes where you can configure author identifier as the partition key, and announcement identifier as the sort key (and make your primary key as the combination of both). This will allow you to query all the announcements for a given author easily.
I would also have a second DynamoDB table to handle reads, where you will only store active announcements which your application can query and retrieve all of it with a Scan query (i.e. O(N)), which is not a concern as you mentioned there will only be 30-40 active announcments at any point in time. Let's imagine this to be even 500, you are still OK with this structure. In terms of partition and sort key, I would just have an active boolean field as the partition key, which you will always have it as true, you can have the announcement id as the sort key, and make the combination of both as the primary key. If you care about the sort of these announcements, you can adjust the sort key accordingly but make sure it's unique (i.e. consider concatenating the announcement identifier, e.g. {displayBeginDatetime-in-yyyyMMddHHmmss-format}-{announcementId}. With this way you will guarantee that you will only hit one partition. However, you can actually simplify this and have the announcement identifier as the partition key and primary key as I am nearly sure that DynamoDB will store all your data in one partition as it's going to be so small. Better to confirm this though as I am not 100% sure. The point here is that you are much better of ensuring hitting one partition with this query.
Here is how this may work, where there are some edge cases I am overlooking:
record the write inside the first DynamoDB for an announcement. When an announcement is written, configure displayEndDatetime as the TTL of that row, with the assumption that you don't need this record in this table when an announcement expires.
have a job running for N minute (one or more, depending on the data inconsistency gap you can handle), which will Scan the entire DynamoDB table across partitions (do it in a paginated way), and makes decisions on which announcements are currently visible. Then, write your data into the second DynamoDB table, which will handle the reads, in the structure we have established above so that your consumer can read from this w/o worrying about any filtering as the data is already filtered (e.g. all the announcements here are visible ones). Note that Scan is fine here as you are running this once every N minutes, with the assumption that you are ok with at least 1 minute + processing time data inconsistency gap. I would suggest running this every 10 minutes or so, if you don't have strong data consistency requirements.
On the read storage system, also configure displayEndDatetime as the TTL for the row so that it gets automatically deleted.
Configure DynamoDB streams on the first DynamoDB table, which has 24 hours retention and exactly once delivery guarantee, and have a lambda consumer of this stream, which to handle when an item is deleted (will happen when TTL kicks in for a particular row) to keep a record of this announcements somewhere else, for longer retention reasons, and will need to expose it through different access pattern (e.g. show all the announcements per author so that they can reenable old announcements), as you mentioned in you question. You can configure a lambda event sourcing with DynamoDb streams, which will allow you to handle failures with retries, etc. Make sure that your logic in these lambdas are idempotent so that you can retry safely.
The below is the parts from my original question, which are still relevant to anyone who might be trying to achieve the same. So, I will leave them here but they are less relevant as the OP needs to use DynamoDB.
Why DynamoDB?
First of all, I would question why you need DynamoDB for this, as it seems like your requirements are more read heavy than it's being write heavy, where I think DynamoDB shines the most due to its partitioned out of the box nature.
Below questions would help you understand whether you really need DynamoDB for this, or can you get away with a more flexible data storage system:
what is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
how much new data do you anticipate storing daily, and how do you think this will grow?
what is the rate of reads (e.g. peak reads per second)?
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most). This will help you understand whether you need will be OK pulling all the visible announcements in one go, or need a pagination system.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
Actually, I don't need DynamoDB
Based on my assumptions on your consumption and admin needs for this use case, I believe you don't need DynamoDB for this with the assumption of not having high number of writes for this (which might be wrong), and if these assumptions are correct, the above is a super over engineered solution for you. Let's say it's correct, I think you are better of using PostgreSQL for this, which can give you easy ability to change your access pattern as you see fit with further indexing, and for the current access pattern you have, you can have a range query over the start and end times.

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.

The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.

My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

Can we avoid scan in dynamodb

I am new the noSQL data modelling so please excuse me if my question is trivial. One advise I found in dynamodb is always supply 'PartitionId' while querying otherwise, it will scan the whole table. But there could be cases where we need listing our items, for instance in case of ecom website, where we need to list our products on list page (with pagination).
How should we perform this listing by avoiding scan or using is efficiently?

Basically, there are three ways of reading data from DynamoDB:
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item, because it provides direct access to the physical location of the item.
Query – Retrieves all of the items that have a specific partition key. Within those items, you can apply a condition to the sort key and retrieve only a subset of the data. Query provides quick, efficient access to the partitions where the data is stored.
Scan – Retrieves all of the items in the specified table. (This operation should not be used with large tables, because it can consume large amounts of system resources.
And that's it. As you see, you should always prefer GetItem (BatchGetItem) to Query, and Query — to Scan.
You could use queries if you add a sort key to your data. I.e. you can use category as a hash key and product name as a sort key, so that the page showing items for a particular category could use querying by that category and product name. But that design is fragile, as you may need other keys for other pages, for example, you may need a vendor + price query if the user looks for a particular mobile phones. Indexes can help here, but they come with their own tradeofs and limitations.
Moreover, filtering by arbitrary expressions is applied after the query / scan operation completes but before you get the results, so you're charged for the whole query / scan. It's literally like filtering the data yourself in the application and not on the database side.
I would say that DynamoDB just is not intended for many kinds of workloads. Probably, it's not suited for your case too. Think of it as of a rich key-value (key to object) store, and not a "classic" RDBMS where indexes come at a lower cost and with less limitations and who provide developers rich querying capabilities.
There is a good article describing potential issues with DynamoDB, take a look. It contains an awesome decision tree that guides you through the DynamoDB argumentation. I'm pasting it here, but please note, that the original author is Forrest Brazeal.
Another article worth reading.
Finally, check out this short answer on SO about DynamoDB usecases and issues.
P.S. There is nothing criminal in doing scans (and I actually do them by schedule once per day in one of my projects), but that's an exceptional case and I regret about the decision to use DynamoDB in that case. It's not efficient in terms of speed, money, support and "dirtiness". I had to increase the capacity before the job and reduce it after, but that's another story…

How do i query a collection in Cosmos DB in two steps?

I have a single collection in Cosmos DB where documents are separated in two types. Let's call them board and pin.
Board:
{
"id": "board-1",
"description": "A collection of nice pins",
"author": "user-a",
"moments": [
{
"id": "pin-1"
},
{
"id": "pin-2"
},
{
"id": "pin-3"
}
]
}
Pin:
{
"id": "pin-1",
"description": "Number 1 is the best pin",
"author": "user-b"
}
I know how to query just a board of pin based on id. But i Need to query that (based on the id of the board) which gives me all the pins contained in a board. It would also be good if I could filter out one or more parts of the Pins.
Example: Not returning the author to the client.
{
"id": "pin-1",
"description": "Number 1 is the best pin"
},
{
"id": "pin-2",
"description": "Number 2 is very funny"
}..etc
I know I could handle this logic in the client app by making two requests, but is it possible to write a query for Cosmos DB that handles this?

Short answer: No, currently you can not join different documents in single sql query.
DocumentDB is schemaless and there is no hard concept of "references" like in a relational DB world. The referencing ids you have in your documents are just regular string data to DocumentDB and their special meaning (of linking to other documents) exists only in your application. Querying is currently just finding documents or parts of a document by some given predicates. It is carried out on documents independently of each other.
As a sidenote: I imagine it is by design as such chosen restriction enable potential of parallelism and probably contribute to the low-latency dream they are intended to deliver.
This does not mean that what you need is impossible. Options to consider:
Option 1: reference redesign
If you had a data design where the board-bin relationship data would have been stored on the pin-side, then you could have queried all pins in board-1 with single query, along the lines of:
select * from pin where pin.boardId = #boardId
It's quite common that you would need to denormalize your data model to some extent to optimize RU usage. Sometimes it is beneficial to duplicate some parent information to referencing documents. Or even store the relationship on both ends if the data is not too volatile and being read a lot from both sides. As a downside, keeping data is sync on writes becomes a bit more complicated. Mmmm, tradeoffs...
If redesign is an option then see the talk Modeling Data for NoSQL Document Databases from //build/2016 by Ryan CrawCour and David Makogon. It should give you some ideas to consider.
When designing data for documentdB then keep in mind that in DocumentDB storage is relatively cheap, processing power (RU) is what you pay for.
Option 2: stored procedures
If you want/need to optimize storage/latency and cannot modify the data design and really-really need to perform such query in single roundtrip then you could build a stored procedure to do the queries on server-side and then pack the results into single returned Json message within DocumentDB.
See Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs for more detail about what can be done and how.
I imagine you may get slightly better latency (due to single call) and slightly worse overall RU usage (extra work for SP execution, transaction, merging the results), but definitely test your case before commiting.
I consider this option a bit dirty as:
1. combining documents by higher level needs is logic and hence should not be implemented in database, but with your application logic layer.
1. JS in DocumentDB is more cumbersome to develop, debug and maintain.
Option 3: change nothing
.. and just do the 2 calls. It's simple and may just as likely end up the best solution in the long run (considering overall cost of design, development, maintenance, changes, etc..).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex