Schema-less AND relational database - self-join

I have thousands of documents of different types and therefore different fields. My task is to find pairs or a group of documents, with specific relations:
A.race='dwarf' and B.race='elfe' and C.profession='thief'
A.haircolor = C.haircolor and B.favorite_meal = C.favorite_meal
Is there a database which is schema-less and relational?

Having a schema is part of the definition of a relational database, so a database can not be both schemaless and relational.
But when you are searching for a database which is good for modeling and analyzing relations between entities without enforcing a consistent schema, you could take a look at graph databases like Neo4j. These databases focus on defining entities mostly by their relation to other entities. They make it really easy to find entities which have common relations to other entities.

Related

DynamoDB optimized search for common parent

So Im designing currently three tables, an organization, organization_relationships, members.
Organization
OrgID PK
Metdata..
Org_Relationships
ParentOrgID PK
ChildOrgID Range/GSI
Member
OrgID PK
MemberID Range/GSI
One way that I need to access data, is by determining whether two members share a parent organization. With the way this is right now, I would basically have to do a weird search on the tables, that requires multiple calls to the table to determine whether two members belong to the same parent organization. With that being said is there a more efficient way of designing the table to do this without requiring multiple calls to the table.
The reason you're having to perform multiple queries is because you've modeled the relationship across several tables. This is a common approach when using traditional relational databases, but could be considered an anti-pattern with NoSQL databases.
Keep in mind that DynamoDB does not have a join operation like SQL databases. Therefore, it is a best practice to store related data in the same DynamoDB table. This can be counter-intuitive if you're used to working with relational DBs.
There are several ways to model your data in DynamoDB. The approach you choose depends on your access patterns. In other words, you store your data in a way that makes it easier to get the data your application needs.
For example, here's one way to model Users and Organizations:
The primary key is made up of a user id (e.g. USER#) and a sort key of META. This record (called an "item") in DynamoDB is where I'll define various user attributes. In this example, I've provided a name and an org attribute.
For illustrative purposes, I've also created a global secondary index (GSI) that swaps the partition key/sort key pattern in your base table. Your GSI will look like this:
This lets you fetch all users by organization.
If I wanted to check if two users are in the same organization, I can either query the GSI, or fetch both user records and compare the org fields.
This is just an example meant to give you a starting point with NoSQL design. The key takeaways here are:
NoSQL (or non-relational) data modeling is different than SQL (relational) data modeling.
You want to store related data in the same table.
How you store your data depends entirely on how you plan to use the data.

How to query from two containers in Cosmos DB (SQL API)

I am new to cosmos db. I chose cosmos db (core sql), created a database having two containers say EmployeeContainer and DepartmentContainer. Now I want to query these two container and want to fetch employee details with associated department details. I stuck on a point and need help.
Below is the structure of my containers.
EmployeeContainer : ID, Name, DepartmentID
DepartmentContainer: ID, Name
Thanks in advance.
Cosmos DB is not a relational database. You do not store different entities in different containers if they are queried together. They are either embedded in other entities or stored as separate rows using a shared partition key with other entities in the same container.
Before you get too far with Cosmos you need to understand how to model and partition data to ensure the best possible performance. I strongly recommend you read the docs on partitioning and specifically read these docs below.
Data modeling in Cosmos DB
Partitioning in Cosmos DB
How to model and partition data - a real world example
And watch Data Modeling in Cosmos DB - What every relational developer should know
It completely depends on the type of data you are trying to model. Generally, it comes down to relationships. 1:1 or 1:few often are best for embedding related items or where queries are updated together. 1:many or many:many for referencing related items are queried or updated independently.
For great talks on these issues check out https://www.gotcosmos.com/conf/ondemand
You can use subquery.
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-subquery#mimic-join-with-external-reference-data
But this may consumes a lot of RU. And only inner join for now.

How to relate collections in Cloud Firestore?

Using SQLite, I had a table of products, one category and one subcategory.
The subcategory had FK of the category, and the product had FK of subcategory.
But since Cloud Firestore is a non-relational database, Id like to understand how it works
the relationship of collections or documents.
Any relationships between documents and collections is purely one that you express in code. Unlike many SQL databases, you can't force referential integrity between items. You can create reference type fields that point to other documents, but those documents don't have to exist. You can also simply store document IDs as string within other related documents. It's entirely flexible and up to you to decide what's best for your specific case.

Should CosmosDB be modeled like a document database or a graph database?

I see that a CosmosDb can support both graph queries as well as more traditional SQL like queries - however I'm a bit confused about what kind of underlying schema is best at the collections level. If I were to model something in MongoDb or SQL Server, or Neo4j, I would have very different schemas. Also - it seems like I can query using more traditional SQL-like syntax - which makes it confusing about what's right or efficient underneath. Sometimes, making something easy to query does not mean that one should assume that it's an efficient query.
Is CosmosDb at it's heart a document database and I should model it accordingly - or is it a very different beast.
Example use case
Here's an example- let's say I have:
a user profile
multiple post types (photo, blog, question)
users can like photos
users can comment on photos, blogs, questions
With a sql database I would have tables:
profiles
photos
blogs
questions
and join tables with referential integrity to support the actions:
photoLikes
blogComments
photoComments
questionComments
With a graph database
I would just have the same core tables
profiles
photos
blogs
questions
and just create graph relationship types for like and comment - relying on the code business logic to enforce the rule that you can't like blogs, etc..
With a document db like MongoDb
Again, I might have the same core tables
profiles
photos
blogs
questions
Comments would be sub collections under each - and there would be a question of whether we want to keep the likes as an embedded collection under each profile, or under photos.. and we would have to increment and sync a like count to the other collection (depending on the use case we might create a like collection as well). Comments would be tucked under each photo, blog or question as an embedded collection and not have their own top-level collection.
So my question is this:
How do we model this schema in CosmosDB? Should we model it like a traditional Document Database like MongoDb, or does having access to a graph query allow us additional freedoms like not having to denormalize fields for actions such as "like?"
Azure Cosmos DB database engine is designed to be fully schema-agnostic.
A container (which can be a graph, a collection of documents, or a table) is a schema-agnostic container of arbitrary user generated content which gets automatically indexed upon ingest. I suggest to read "Schema-Agnostic Indexing with Azure DocumentDB" - http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf, which is the same in Cosmos DB to better understand the details.
How do we model this schema in CosmosDB? Should we model it like a traditional Document Database like MongoDb, or does having access to a graph query allow us additional freedoms like not having to denormalize fields for actions such as "like?"
When you start modeling data in Azure Cosmos DB, you need to consider: 1.Is your application read heavy or write heavy? 2.How is your application going to query and update data? etc. Normally denormalized data models can provide better read performance, normalizing can provide better write performance.
This article explained with example how to model document data for NoSQL databases, and shared some scenarios for using embedded data models, normalized data models and Hybrid data models, which should be helpful.

sql server database design

I am planning to create a website using ASP.NET and SQL Server. However, my plan for the database design leaves me wondering if there is a better way.
The website will serve as a repository of information for various users. I figure I would have two databases, a Membership and Profile database.
The profile database would contain user data for all users, where each user may have ~20 tables. I would create the tables when the user account is created and generate a key used to name the tables. The tables are not directly related.
For Example a set of tables for two different users could look like:
User1 Tables - TransactionTable_Key1, AssetTable_Key1, ResearchTable_Key1 ....;
User2 Tables - TransactionTable_Key2, AssetTable_Key2, ResearchTable_Key2 ....;
The Key1, Key2 etc.. values would be retrieved based on the MembershipID data when the account was created. This could result in a very large number of tables over time. I'm not sure if this will limit scalability by setting up the database in this way. Any recommendations?
Edit: I should mention that some of these tables would contain 20k+ rows.
Realistically it sounds like you only really need one database for this.
From the way you worded your question, it sounds like you're trying to dynamically create tables for users as they create accounts. I wouldn't recommend this method.
What you want to do is create a master table that contains a primary key for each individual user. I'm assuming this is the Membership table. Then create the ~20 tables that you need for the profiles of these members. Every record, no matter the number of users that you have, will go into these tables. These 20 tables would need to have a foreign key pointing to the unique identifier of the Membership table.
When you want to query a Member for their user information, just select from the tables where the membership table's primary Id matches the foreign key in the profile tables.
This would result in only a few tables in the end and is easily maintainable and follows better database design.
Your ORM layer (EF, LINQ, DAL code) will hate having to deal with one set of tables per tenant. It is much better to have either one set of tables for all tenant in a single database, or a separate database per tenant. The later is only better if schema upgrade has to be vetted by tenant (like Salesforce.com has). If you can afford to upgrade all tenant to a new schema at once then there is no reason for database per tenant.
When you design a schema that hold multiple tenant the important things to remember are
don't use heaps, all tables must be clustered index
add the tenant ID as the leftmost key to every clustered
add the tenant ID as the leftmost key to every non-clustered index too
add the Left.tenantID = right.tenantID predicate to every join
add the table.TenantID = #currentTenantID to every query
These are fairly simple rules and if you obey them (with no exceptions) you will get a perfect partitioning per tenant of every query (no query will ever ever scan rows in a range of a different tenant) so you eliminate contention between tenants. To be more through, you can disable lock escalation to make sure no tenant escalates to block every other tenant.
This design also lends itself to table partitioning and to sharing the database for scale-out.
You definitely don't want to create a set of tables for each user, and you would want these only in one database. Even with SQL Server 2008's large capacity for tables (note really total objects in database), it would quickly become unmanageable. Your best bet is to use 20 tables, and separate them via a column into user areas. You might consider partitioning the tables by this user value, but that should be tested for performance reasons too.
Yes, since the tables only contain id, key, and value, why not make one single table?
Have the columns:
id, user ID, key, value
Put an Index on the user ID field.
A key idea behind a relational database is that the table structure does not change. You create a solid set of tables, and these are the "bones" of your application.
Cheers,
Daniel
Neal,
The solution really depends on your requirement. If security and data access are concern and you have only a handful of users, you can set up a different db for each user with access for him set to only his/her database.
Other wise, what Daniel Williams suggested is a good alternative where you have one DB and tables laid out with a indexed column partitioning the users data rows.
It's hard to tell from the summary, but it looks like you are designing for dynamic attribution by user. This design approach is called EAV (Entity-Attribute-Value) and consists of a simple base collection key (UserID, SiteID, ProductID...) and then rows consisting of name/value pairs. In a more complex version, categories are sometimes added as "super columns" to the tuple/row and provide sub-groupings for a set of name/value pairs.
Designing in this way moves responsibility for data type integrity, relational integrity and tuple integrity to the application layer.
The risk with doing this in a relational system involves the breaking of the tuple or row into a set of rows. Updates, deletes, missing values and the definition of a tuple are no longer easily accessible through human interaction. As your application evolves and the definition of a tuple changes, it becomes almost impossible to tell if a name/value pair is missing because it's part of an earlier-version tuple or because it was unintentionally deleted. Ad-hoc research as well becomes harder to manage as business analysts must keep an understanding of the virtual structure either in their heads or in documentation provided.
If you are looking to implement an EAV model, I would suggest you look at a non-relational solution (nosql) like MongoDB or CouchDB. These stores allow a developer to save and retrieve "documents" or json-formatted messages that are essentially made up of a collection of name/value pairs and can look very much like a serialized object. The advantage here is that you can store dynamic attribution without breaking your tuple. You always know that you have a complete tuple because you can store and retrieve it as a single "blob" of information that can be serialized and deserialized at-will. You can also update single attributes within the tuple, if that's a concern.
MongoDB also provides some database-like features such as multiple-attribute indexes, a query engine that is robust in comparison to other similar non-relational offerings and a sharding solution that is much less trouble than trying to do it with MySQL.
I hope this helps.

Resources