How do i query a collection in Cosmos DB in two steps? - azure-cosmosdb

I have a single collection in Cosmos DB where documents are separated in two types. Let's call them board and pin.
Board:
{
"id": "board-1",
"description": "A collection of nice pins",
"author": "user-a",
"moments": [
{
"id": "pin-1"
},
{
"id": "pin-2"
},
{
"id": "pin-3"
}
]
}
Pin:
{
"id": "pin-1",
"description": "Number 1 is the best pin",
"author": "user-b"
}
I know how to query just a board of pin based on id. But i Need to query that (based on the id of the board) which gives me all the pins contained in a board. It would also be good if I could filter out one or more parts of the Pins.
Example: Not returning the author to the client.
{
"id": "pin-1",
"description": "Number 1 is the best pin"
},
{
"id": "pin-2",
"description": "Number 2 is very funny"
}..etc
I know I could handle this logic in the client app by making two requests, but is it possible to write a query for Cosmos DB that handles this?

Short answer: No, currently you can not join different documents in single sql query.
DocumentDB is schemaless and there is no hard concept of "references" like in a relational DB world. The referencing ids you have in your documents are just regular string data to DocumentDB and their special meaning (of linking to other documents) exists only in your application. Querying is currently just finding documents or parts of a document by some given predicates. It is carried out on documents independently of each other.
As a sidenote: I imagine it is by design as such chosen restriction enable potential of parallelism and probably contribute to the low-latency dream they are intended to deliver.
This does not mean that what you need is impossible. Options to consider:
Option 1: reference redesign
If you had a data design where the board-bin relationship data would have been stored on the pin-side, then you could have queried all pins in board-1 with single query, along the lines of:
select * from pin where pin.boardId = #boardId
It's quite common that you would need to denormalize your data model to some extent to optimize RU usage. Sometimes it is beneficial to duplicate some parent information to referencing documents. Or even store the relationship on both ends if the data is not too volatile and being read a lot from both sides. As a downside, keeping data is sync on writes becomes a bit more complicated. Mmmm, tradeoffs...
If redesign is an option then see the talk Modeling Data for NoSQL Document Databases from //build/2016 by Ryan CrawCour and David Makogon. It should give you some ideas to consider.
When designing data for documentdB then keep in mind that in DocumentDB storage is relatively cheap, processing power (RU) is what you pay for.
Option 2: stored procedures
If you want/need to optimize storage/latency and cannot modify the data design and really-really need to perform such query in single roundtrip then you could build a stored procedure to do the queries on server-side and then pack the results into single returned Json message within DocumentDB.
See Azure Cosmos DB server-side programming: Stored procedures, database triggers, and UDFs for more detail about what can be done and how.
I imagine you may get slightly better latency (due to single call) and slightly worse overall RU usage (extra work for SP execution, transaction, merging the results), but definitely test your case before commiting.
I consider this option a bit dirty as:
1. combining documents by higher level needs is logic and hence should not be implemented in database, but with your application logic layer.
1. JS in DocumentDB is more cumbersome to develop, debug and maintain.
Option 3: change nothing
.. and just do the 2 calls. It's simple and may just as likely end up the best solution in the long run (considering overall cost of design, development, maintenance, changes, etc..).

Related

Resolve FK in firestore

I have some documents in firestore have some fields in it. like collection "details" looks like this
{
id: "",
fields1: "",
userFK: Reference to users collection
}
Now I need to resolve userFK on the fly means that I don't want first fetch all the documents then query to userFk userFK.get()
Is there any method, its like doing a $lookup whick is supported in mongodb
Even In some case I want to fetch documents from "details" collection based of some specific fields in users
There is no way to get documents of multiple types from Firestore with a single read operation. To get the user document referenced by userFK you will have to perform a separate read operation.
This is normal when using NoSQL databases like Cloud Firestore, as they typically don't support any server-side equivalent of a SQL JOIN statement. The performance of loading these additional details is not as bad as you may think though, so be sure to measure how long it takes for your use-case before writing it off as not feasible.
If this additional load is prohibitive for a scenario, an alternative is to duplicate the necessary data of the user into each details document. So instead of only storing the reference to their document, you'd for example also store the user name.
This puts more work on the write operation, but makes the read operations simpler and more scalable. This is the common trade-off of space vs time, where in NoSQL databases you'll often find yourself trading time for space: so storing duplicate data.
If you're new to NoSQL data modeling, I highly recommend:
NoSQL data modeling
Getting to know Cloud Firestore

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

Efficient way of returning item list from azure cosmos db

I want to store data of the following form in azure cosmos db:
{
"id": "guid",
"name": "a name"
"tenantId": "guid",
"filter1": true,
"filter2": false,
"hierarchicalData" :{}
}
Each document will be up to a few megabytes in size.
I need to be able to return a {id, name} list (100 < count < 10k, per tenant) for a given search by {tenantId,filter1,filter2}.
From the documentation, I see I can do an SQL query with a projection, but am not sure if there is a better way.
Is there an ideal way to do the above while making efficient use of RUs?
Is there an ideal way to do the above while making efficient use of
RUs?
Maybe it's hard to say that there is a best way to make efficient use of RUs and improve the query performance.
Based on your situation,of course, you could use SQL query to get data with specific filters. I'm just offering several ways to improve your query performance as below:
1.Add a partition key.
If your data is partitioned, then when you provide the partition key with sql,it could only scan the specific partition so that it will save RUs. Please refer to the document.
2.Use recent sdk.
The Azure Cosmos DB SDKs are constantly being improved to provide the best performance. See the Azure Cosmos DB SDK pages to determine the most recent SDK and review improvements.
3.Exclude unused paths from indexing for faster writes.
Cosmos DB's indexing policy also allows you to specify which document paths to include or exclude from indexing by leveraging Indexing Paths (IndexingPolicy.IncludedPaths and IndexingPolicy.ExcludedPaths). The use of indexing paths can offer improved write performance and lower index storage for scenarios in which the query patterns are known beforehand.
4.Use continuation token if the data is too large.
Paging the data with continuation token to improve query performance.Doc:
https://www.kevinkuszyk.com/2016/08/19/paging-through-query-results-in-azure-documentdb/
More details, please refer to here.

Do DocumentDB Subqueries Perform Well Enough to do RDMS Style Joins?

DocumentDB has it's strengths. I think most would agree that creating associations between documents is not one of those strengths.
From what I read, the common strategy is to keep your data as denormalized as possible and custom logic around updating denormalized data when the original changes.
But what if you need your data to be normalized in some places? Lets say I have People and IceCreamFlavors. a Person has a FavorityIceCreamFlavor that is an id to an IceCreamFlavor.
From what I understand, if I need to get the IceCreamFlavor for this person, I'll need to run a second query to fetch the IceCreamFlavor.
(single collection documentdb)
SELECT * FROM c c.id = "person-1"
{
"firstName": "John",
"lastname": "Doe",
"favorityIceCreamFlavor": "icecream-4"
}
Fetch IceCreamFlavor-
select * From c WHERE c.id = "icecream-4"
{
"name": "Chocolate"
}
Combine Objects-
{
"firstName": "John",
"lastname": "Doe",
"favorityIceCreamFlavor": {
"name": "Chocolate"
}
}
Obviously not ideal, but if i'm looking a persons profile this isn't the worst. Additionally, with this flavor of document storage (documentdb), I can create stored procedures so I can do this sub-query server side.
But what if I'm an administrator and I want to see all of my users and their favorite ice creams?
This is starting to look like a problem. It looks like I have to do 11 sub-queries to fetch each users ice cream flavor.
This may simply be a problem that document storage cannot handle efficiently. But i'm making that assumption- I don't know the how documentdb works under the hood.
Should I be concerned about documentdb doing a query for each record here in a stored procedure?
Do DocumentDB Subqueries Perform Well Enough to do RDMS Style Joins?
A database needs to do two queries to accomplish a join. Now both of them can be in cache either for just the indexes or in some cases the entire operation. Also, that work is done in the same memory space and very close to the data in such a way that throughput constraints don't come into play.
DocumentDB/CosmosDB has something very close to that if you can do both queries in a stored procedure. You can only do that if both sets are in the same partition and they can be accomplished before the sproc times out (happens between 5K and 20K docs retrieved on large databases), but if you can use a stored procedure, then you are in the same memory space and very close to the data. The difference in latency between an SQL join and two round trips in DocumentDB/CosmosDB sproc will be minimal, single digit milliseconds on a database of 100k documents where your query only pulls back 100's of documents in my estimation.
A couple of other downsides to using sprocs for queries I should mention: 1) It can consume more RUs depending upon the complexity of your join logic, and 2) Sprocs execute in isolation which can limit concurrency and reduce the overall throughput of the system. On the other hand, you get guaranteed ACID consistency even when one of the other less strong consistency models is in effect for non-sproc queries.
If you can't use a sproc because of the reasons discussed above, then you'll need to pull the data back across the wire for the first query before composing the second one. In this case, you will run into throughput constraints and additional latency. How much depends upon a lot parameters. Using an app server in the same data center as the DocumentDB/CosmosDB partition holding the data will keep this to a minimum, but even that will still come with a penalty. It sill may be milliseconds difference depending but it will have an effect. If you have to leave the data center with the first round before composing the second, the effect will be even greater.
Every application is different but for typical OLTP traffic, I've always been able to get the performance I've needed out DocumentDB. And even heavy analytical loads can work especially if you are careful with partition key choice to get sufficient parallelism. I suggest you give it a try with a simple experiment close to your intended final product and see how it does.
Hope this helps.

Firebase: server side logic and real time database limitations

Server side custom operations equivalent to Parse cloud code:
Parse has the possibility to write cloud code. From my understanding of it Firebase doesn't offer any tools to do so on the console.
The only way to do so would be to implement a web-service using the Firebase API and monitor nodes changes and implement the cloud code on my own server.
A - Is this correct?
Server side rules:
The legacy documentation of Firebase describes rules which seem to be limited to deciding which user can read/write as well as validation.
{
"rules": {
"foo": {
// /foo/ is readable by the world
".read": true,
// /foo/ is writable by the world
".write": true,
// data written to /foo/ must be a string less than 100 characters
".validate": "newData.isString() && newData.val().length < 100"
}
} }
On Parse the complexity of the rules is greater. The programmer is able to create functions to perform custom operations.
Understanding the reason why Firebase is designed as it is:
I imagine that the reason for not having this complexity on Firebase is that probably a node based database is more complex than a table based one and it would be better if the developer has full control of this using the web-api and a custom server app.
B - Is this correct?
Real time database limitations:
The main limitation when using a real time database like Firebase seems to me that once you fetch some real time data if the data contains a two way redundancy the events triggered on one node are not propagated to the node containing the redundant information.
E.g. If a user node has keys id (ids of a different node at the same level of the user node) and if I display the list of keys that a user has on a table view in order to detect if the key list has changed I need to listen to changes in the keys node (and not only to changes in the user node).
- C: Is this correct?
The question is a tad vague as there are no use cases but based on the comments, here you go.
A) Yes, Maybe.
Yes, there is no server side logic (code-wise).
Maybe, it depends on what you are trying to do.
B) Firebase rules are very flexible; Rules can limit who can access data, read/write access, what kind of data, type of data, location of data etc. It is neither more or less complex than a 'table based one'. It's just a different way to verify, validate and store your data.
Just an FYI: Parse was was backed by MongoDB which is a document based NoSQL database (it's not table-based). On the back-end, Parse data was stored in a way similar to Firebase. (it's actually BSON). Their front-end implementation were objects that wrapped around JSON structures which gave the feeling that it was more table-like than Firebase, and that lead to the direct ability to have relationships between PFobjects.
C) No. You can observe any node for changes. In your example, you should probably have the keys node as a separate node than the /user node and have users observe the /keys node.
To expand on that a bit, data doesn't necessarily need to be redundant, but it can be. You only need to observe changes for the specific data you are interested in.
Suppose you have a /users node and /favorite_food node of each user. If your UI is displaying a list of users and one of them changes their favorite food, you don't really care - you are just interested in the list of users. So you would observe the /users node and not the /favorite_food node.

Resources