How to query from two containers in Cosmos DB (SQL API) - azure-cosmosdb

I am new to cosmos db. I chose cosmos db (core sql), created a database having two containers say EmployeeContainer and DepartmentContainer. Now I want to query these two container and want to fetch employee details with associated department details. I stuck on a point and need help.
Below is the structure of my containers.
EmployeeContainer : ID, Name, DepartmentID
DepartmentContainer: ID, Name
Thanks in advance.

Cosmos DB is not a relational database. You do not store different entities in different containers if they are queried together. They are either embedded in other entities or stored as separate rows using a shared partition key with other entities in the same container.
Before you get too far with Cosmos you need to understand how to model and partition data to ensure the best possible performance. I strongly recommend you read the docs on partitioning and specifically read these docs below.
Data modeling in Cosmos DB
Partitioning in Cosmos DB
How to model and partition data - a real world example
And watch Data Modeling in Cosmos DB - What every relational developer should know

It completely depends on the type of data you are trying to model. Generally, it comes down to relationships. 1:1 or 1:few often are best for embedding related items or where queries are updated together. 1:many or many:many for referencing related items are queried or updated independently.
For great talks on these issues check out https://www.gotcosmos.com/conf/ondemand

You can use subquery.
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-subquery#mimic-join-with-external-reference-data
But this may consumes a lot of RU. And only inner join for now.

Related

Is there a way to see what a Cosmos DB Gremlin API call looks like under the hood?

If I write: g.V().has("person","name","Diego");
What atom-record-sequence calls are used to make this query work?
Is there a way to directly query the atom-record-sequence in a Cosmos DB?
According to this Ms Doc The core type system of Azure Cosmos DB’s database engine is atom-record-sequence (ARS) based. Atoms consist of a small set of primitive types e.g. string, bool, number etc., records are structs and sequences are arrays consisting of atoms, records or sequences. The database engine of Azure Cosmos DB is capable of efficiently translating and projecting the data models onto the ARS based data model.
The engine is agnostic to the concept of a schema, blurring the boundary between the structure and instance values of records. Cosmos DB achieves full schema agnosticism by automatically indexing everything upon ingestion in an efficient manner, which allows users to query their globally distributed data without having to deal with schema or index management.
The article below explains this pretty well
https://learn.microsoft.com/en-us/azure/cosmos-db/global-dist-under-the-hood

Best Practices for storing a List of Values set using a NoSQL Database

I am working on a solution that uses a NoSQL backend. My experience is traditionally with relational databases and would like to discuss the best way to store a list of values which may appear in a drop-down from the UI. Traditonally, I would just create a table in my relational DB to store that small set of values and then my records would tie to a specific id representing that value. A simple example of this is a Person table with all of my person records and a Hair color list of values with all the possible hair colors. For each person, a hair color id from that hair color list of values table would be stored in the person record. So a traditional foreign key relationship.
Most of these drop downs are not huge they are small sets (10s of fields) so storing them in their own container within Cosmos seems like overkill. I thought I could also set these values as constants in my API model and manage the values that way. However if those values change I need to do a new build of the API in order to expose those values.
Any thoughts on best practices for how to handle in the NoSQL space? Create a container in the NoSQL backend with the list of values, store the values as constants within my API model or something else?
Appreciate your time considering this question.
In these scenarios for reference data for UI elements I typically recommend storing all of this data in a single Cosmos container. Cosmos is schema agnostic so you can mix/match different schemas of data. If the data is <10GB use a dummy partition key (ie. /pk) with a single value and use a "type" property to distinguish among the different entity types for the data that match your UI elements. Fetch the data using a single query with the pk and then deserialize it into POCO's or whatever hydrates your UI using the type property to distinguish the different UI elements.
You can store this in a container that is part of a shared database. Minimum RU would be 100 RU/s with four containers in a database or 400 RU/s for dedicated container throughput. Which one you choose will depend on how much RU/s the query that fetches this data costs. You can easily get that by running the query in the Azure portal and looking at the query stats.

Should CosmosDB be modeled like a document database or a graph database?

I see that a CosmosDb can support both graph queries as well as more traditional SQL like queries - however I'm a bit confused about what kind of underlying schema is best at the collections level. If I were to model something in MongoDb or SQL Server, or Neo4j, I would have very different schemas. Also - it seems like I can query using more traditional SQL-like syntax - which makes it confusing about what's right or efficient underneath. Sometimes, making something easy to query does not mean that one should assume that it's an efficient query.
Is CosmosDb at it's heart a document database and I should model it accordingly - or is it a very different beast.
Example use case
Here's an example- let's say I have:
a user profile
multiple post types (photo, blog, question)
users can like photos
users can comment on photos, blogs, questions
With a sql database I would have tables:
profiles
photos
blogs
questions
and join tables with referential integrity to support the actions:
photoLikes
blogComments
photoComments
questionComments
With a graph database
I would just have the same core tables
profiles
photos
blogs
questions
and just create graph relationship types for like and comment - relying on the code business logic to enforce the rule that you can't like blogs, etc..
With a document db like MongoDb
Again, I might have the same core tables
profiles
photos
blogs
questions
Comments would be sub collections under each - and there would be a question of whether we want to keep the likes as an embedded collection under each profile, or under photos.. and we would have to increment and sync a like count to the other collection (depending on the use case we might create a like collection as well). Comments would be tucked under each photo, blog or question as an embedded collection and not have their own top-level collection.
So my question is this:
How do we model this schema in CosmosDB? Should we model it like a traditional Document Database like MongoDb, or does having access to a graph query allow us additional freedoms like not having to denormalize fields for actions such as "like?"
Azure Cosmos DB database engine is designed to be fully schema-agnostic.
A container (which can be a graph, a collection of documents, or a table) is a schema-agnostic container of arbitrary user generated content which gets automatically indexed upon ingest. I suggest to read "Schema-Agnostic Indexing with Azure DocumentDB" - http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf, which is the same in Cosmos DB to better understand the details.
How do we model this schema in CosmosDB? Should we model it like a traditional Document Database like MongoDb, or does having access to a graph query allow us additional freedoms like not having to denormalize fields for actions such as "like?"
When you start modeling data in Azure Cosmos DB, you need to consider: 1.Is your application read heavy or write heavy? 2.How is your application going to query and update data? etc. Normally denormalized data models can provide better read performance, normalizing can provide better write performance.
This article explained with example how to model document data for NoSQL databases, and shared some scenarios for using embedded data models, normalized data models and Hybrid data models, which should be helpful.

More information on Columnar or 'Column-family' data model in Cosmos DB

The documenation states that the Cosmos DB engine "natively supports multiple data models: key-value, documents, graphs, and columnar ".
However, I can't seem to find any other information relating specifically to the columnar model.
There is also information available on the following APIs:
DocumentDB APIs
Table APIs
Graph APIs
But nothing on Columnar or Column-family, as described in various summaries.
Reference: https://learn.microsoft.com/en-us/azure/cosmos-db/introduction
can't seem to find any other information relating specifically to the columnar model
This article will help you understand the concept of column-family:
You can think of a column-family database as holding tabular data with rows and columns, but the columns are divided into groups known as column families. Each column family holds a set of columns that are logically related together and are typically retrieved or manipulated as a unit. Other data that is accessed separately can be stored in separate column families. Within a column family, new columns can be added dynamically, and rows can be sparse (that is, a row doesn't need to have a value for every column).
Besides, as David Makogon said, you can give your feedback (or comment) on that page or contact Cosmos DB team on this email (askdocdb#microsoft.com) for more details about column-family data model.
Cosmos DB now has a Cassandra API, as of November, 2017, which provides a column-store interface. This uses the same protocol as native Cassandra, allowing you to use existing SDK's to connect to the Cassandra API.
You'll need to choose the Cassandra API when creating a new Cosmos DB account, which will be one of several APIs you can select from (the others being DocumentDB SQL, MongoDB, Gremlin, and Table).
More information about the Cassandra API is available here.

sql server database design

I am planning to create a website using ASP.NET and SQL Server. However, my plan for the database design leaves me wondering if there is a better way.
The website will serve as a repository of information for various users. I figure I would have two databases, a Membership and Profile database.
The profile database would contain user data for all users, where each user may have ~20 tables. I would create the tables when the user account is created and generate a key used to name the tables. The tables are not directly related.
For Example a set of tables for two different users could look like:
User1 Tables - TransactionTable_Key1, AssetTable_Key1, ResearchTable_Key1 ....;
User2 Tables - TransactionTable_Key2, AssetTable_Key2, ResearchTable_Key2 ....;
The Key1, Key2 etc.. values would be retrieved based on the MembershipID data when the account was created. This could result in a very large number of tables over time. I'm not sure if this will limit scalability by setting up the database in this way. Any recommendations?
Edit: I should mention that some of these tables would contain 20k+ rows.
Realistically it sounds like you only really need one database for this.
From the way you worded your question, it sounds like you're trying to dynamically create tables for users as they create accounts. I wouldn't recommend this method.
What you want to do is create a master table that contains a primary key for each individual user. I'm assuming this is the Membership table. Then create the ~20 tables that you need for the profiles of these members. Every record, no matter the number of users that you have, will go into these tables. These 20 tables would need to have a foreign key pointing to the unique identifier of the Membership table.
When you want to query a Member for their user information, just select from the tables where the membership table's primary Id matches the foreign key in the profile tables.
This would result in only a few tables in the end and is easily maintainable and follows better database design.
Your ORM layer (EF, LINQ, DAL code) will hate having to deal with one set of tables per tenant. It is much better to have either one set of tables for all tenant in a single database, or a separate database per tenant. The later is only better if schema upgrade has to be vetted by tenant (like Salesforce.com has). If you can afford to upgrade all tenant to a new schema at once then there is no reason for database per tenant.
When you design a schema that hold multiple tenant the important things to remember are
don't use heaps, all tables must be clustered index
add the tenant ID as the leftmost key to every clustered
add the tenant ID as the leftmost key to every non-clustered index too
add the Left.tenantID = right.tenantID predicate to every join
add the table.TenantID = #currentTenantID to every query
These are fairly simple rules and if you obey them (with no exceptions) you will get a perfect partitioning per tenant of every query (no query will ever ever scan rows in a range of a different tenant) so you eliminate contention between tenants. To be more through, you can disable lock escalation to make sure no tenant escalates to block every other tenant.
This design also lends itself to table partitioning and to sharing the database for scale-out.
You definitely don't want to create a set of tables for each user, and you would want these only in one database. Even with SQL Server 2008's large capacity for tables (note really total objects in database), it would quickly become unmanageable. Your best bet is to use 20 tables, and separate them via a column into user areas. You might consider partitioning the tables by this user value, but that should be tested for performance reasons too.
Yes, since the tables only contain id, key, and value, why not make one single table?
Have the columns:
id, user ID, key, value
Put an Index on the user ID field.
A key idea behind a relational database is that the table structure does not change. You create a solid set of tables, and these are the "bones" of your application.
Cheers,
Daniel
Neal,
The solution really depends on your requirement. If security and data access are concern and you have only a handful of users, you can set up a different db for each user with access for him set to only his/her database.
Other wise, what Daniel Williams suggested is a good alternative where you have one DB and tables laid out with a indexed column partitioning the users data rows.
It's hard to tell from the summary, but it looks like you are designing for dynamic attribution by user. This design approach is called EAV (Entity-Attribute-Value) and consists of a simple base collection key (UserID, SiteID, ProductID...) and then rows consisting of name/value pairs. In a more complex version, categories are sometimes added as "super columns" to the tuple/row and provide sub-groupings for a set of name/value pairs.
Designing in this way moves responsibility for data type integrity, relational integrity and tuple integrity to the application layer.
The risk with doing this in a relational system involves the breaking of the tuple or row into a set of rows. Updates, deletes, missing values and the definition of a tuple are no longer easily accessible through human interaction. As your application evolves and the definition of a tuple changes, it becomes almost impossible to tell if a name/value pair is missing because it's part of an earlier-version tuple or because it was unintentionally deleted. Ad-hoc research as well becomes harder to manage as business analysts must keep an understanding of the virtual structure either in their heads or in documentation provided.
If you are looking to implement an EAV model, I would suggest you look at a non-relational solution (nosql) like MongoDB or CouchDB. These stores allow a developer to save and retrieve "documents" or json-formatted messages that are essentially made up of a collection of name/value pairs and can look very much like a serialized object. The advantage here is that you can store dynamic attribution without breaking your tuple. You always know that you have a complete tuple because you can store and retrieve it as a single "blob" of information that can be serialized and deserialized at-will. You can also update single attributes within the tuple, if that's a concern.
MongoDB also provides some database-like features such as multiple-attribute indexes, a query engine that is robust in comparison to other similar non-relational offerings and a sharding solution that is much less trouble than trying to do it with MySQL.
I hope this helps.

Resources