When to separate columns into new table - asp.net

I have company, customer, supplier etc tables which all have address information related columns.
I am trying to figure out if I should create a new table 'addresses' and separate all address columns to that.
Having address columns on all tables is easy to use and query but I am not sure if it is the right way of doing it from a good design perspective, having these same columns repeat over few tables is making me curious.
Content of the address is not important for me, I will not be checking or using these addresses on any decision making processes, they are purely information related. Currently I am looking at 5 tables that have address information

The answer to all design questions is this:
It depends.
So basically, in the Address case it depends on whether or not you will have more than 1 address per customer. If you will have more than 1, put it in a new Addresses table and give each address a CustomerID. It's overkill (most times, it depends!) to create a generic Address table and map it to the company/customer/supplier tables.
It's also often overkill (and dangerous) to map addresses in a many-to-many relationship between your objects (as addresses can seem to magically change on users if you do this).
The one big rule is: Keep it simple!

This is called Database Normalization. And yes, you want to split them up, if for no other reason because if you need to in the future it will be much harder when you have code and queries in place.
As a rule, you should always design your database in 3rd Normal Form, even for simple apps (there will be a few cases where you won't for performance or logistic reasons, but starting out I would always try to make it 3rd Normal Form, and then learn to cheat after you know the right way of doing it).
EDIT: To expand on this and add some of the comments I have made on other's posts, I am a big believer in starting with a simple design when it comes to code and refactoring when it becomes clear that it is becoming too complex and more indepth object oriented principles would be appropriate. However, refactoring a database that is in production is not so simple. It is all about ROI. It is just too easy to design a normalized database from the outset to justify not doing it. The consequences of a poorly designed database can be catastrophic and it is usually too late before you come to that realization.

Yes, you should separate the addresses to a table of their own. It's a smart thing to know to ask. The key here is that general format of addresses is the same, regardless of who it is; a customer, a company, a supplier... they all have the same fields for addresses.
What makes this worthwhile is the ability to treat addresses as an atomic element; that is, you can generalize all the functionality related to addresses and have it deal with just one table, as opposed to having to worry about it dealing with several tables, and the associated schema drift that can occur.

If you are using those addresses only within the scope of their own tables, there may be no real benefit to moving them to their own tables.
Basically, it doesn't sound like it's worth the effort.

If there's an overlap between tables (i.e. the same organization is entered in both the company and supplier tables), and the address should always be the same in both tables, then it's probably worth moving address off in to its own table and having foreign keys to it from your other three tables. That way, you only have to update it in one spot when it changes.
If the three tables are entirely independent from each other, then there's not really much to gain from moving the data to another table, so you might as well leave it alone.

I think it entirely depends on the purpose of the database. Admittedly all address information is structurally the same and from a theoretical standpoint should all be in a single table linked from the parent table by a key.
However from a performance and query perspective, keeping them in their respective tables does simplify things from a reporting standpoint.
I have a situation with my current company [logistics] where the addresses are actually logically the same - they're all locations regardless of whether they're a pickup location, delivery location, customer etc.
In my case, I'd say that they should most definitely all be in one table. But if it's looking at it from a supplier, customer, contact information standpoint, I'd say that while theoretically it's nice to have the addresses in one table, in practice it won't buy you a whole lot as the data is unlikely to be repeated.

I disagree with Dave. The many-to-many approach (Address <-> User) is both safe, and highly advantageous.
When a customer moves, the addresses in the Address table does NOT change. Instead, the new address is found in the Address table, and the customer etc. is linked to that record. If the new address isn't already in the table, it's added to it.
So do address records themselves ever change? Yes, in cases like these:
it turns out that the address has a typo
US postal service changes the street name
These are the very situations where putting all addresses in one table without repetition pays off; any other arrangement would require an annoying and repetitive data entry.
Of course, if the database is abused, then it would be safer to avoid the many-to-many relationship. But by that token, if the database is in bad hands, it's better to just print everything out, store it in a file cabinet, and verify every transaction against the paper copy. So "protection against misuse" is not a good design principle, in my opinion.

Related

Are the IDs reliable?

I'm starting to work with the Clockify APIs and I'd like to know if the different IDs are reliable or not? As in, is it a really bad idea to keep ther IDs in my database to know what's what or that's something that would work long term for sure? Thank you
IDs in Clockify represent the identities of their respective entities. They don't change, and are unique across the board, so you can use them in your database if you choose so.
That being said, it's always a good practice when dealing with outside data to assign them your own IDs, that way you're not reliant on contracts that you cannot enforce. Provide every entity with id (your own) and externalId or clockifyId and you won't ever be in position when outside change affected your domain logic.

CosmosDB/DocumentDB partitioning with multiple types in same collection

Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.
Now, if we assume large databases with partitioning where different object types can be:
Completely different fields (so no common field for partitioning)
Related (through reference)
How to organize things so that things that should go together end up in same partition?
For example, lets say we have:
User
BlogPost
BlogPostComment
If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition?
Is there some best practice for this?
[UPDATE]
Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them?
For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?
I've written about this somewhat extensively in other similar questions regarding Cosmos.
Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.
This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.
Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.
UPDATE:
For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.
A partition should hold data related to a group that is expected to grow, for instance a Tenant which will group many documents (which can be of different types as you have mentioned) So the Partition Key in this instance should be the TenantId. The partitioning is more about the data relating to a group than the type of data. If the data is related to a User then you could use the UserId, however many users may comment on the same posts so it doesn't seem like a good candidate for a partition key unless there is some de-normalization of the user info so it doest have to relate back to the other users directly.. if that makes sense?

What is the advantage of using 1 to many relationship over adding 1 more column in this particular situation?

This is a typical situation for 1 to many relationships: a chat group iOS app, a group table to record all the group chat related information, like group id, create time, thread title, etc.
To record the participants, of course, I would assume there is another 1:m table. So I was rather surprised to see the app just added another column called "participants" to record it, with each participant is separated by a delimiter (':' to be exact). The problem with that is quite obvious, mixing application code with sql code, e.g. no way to see how many groups a specific user is in with sql code, violated 1NF/2NF, etc.
But they said we understood all your points. But
as this is a mobile app, you always need to use objective c code to access sqlite tables, you won't use sql codes alone. So not a "big deal" to mix them together.
participants don't change often and normally are set when a group is created. If we have 100 participants we would rather just insert 1 record to group table instead of insert 100 records into another group-participants table.
The participant data will be used when someone wants to see who are in this chat group (by several taps on the menu) and when someone joins or leaves the chat group, assume it won't happen often.
So my question is in this particular situation what is the advantage I will gain if I use another 1:m table?
----- update -----
Except for the answer I got, Renzo kindly pointed this discussion to me, which is also very helpful!
It's hard to respond to "is this design better/worse" style questions without understanding the full context. I'm going to make some assumptions based on your question.
You appear to be building a mobile application, supporting "many to many" user chat. I'm picturing something like Slack.
Your application design is using the SQLite database for local storage.
Your local sqlite database on the phone is some kind of subset of the overall application data - like a cache, only showing the data for the current user.
If all that is true, the question really is down to style/maintainability on the one hand, and performance and scalability on the other.
From a "style" point of view, storing the data in a comma-separated value in a column is ugly. A new developer who joins the project, with a background in "regular" database design will consider it at best a hack. On the other hand, iOS developers may consider it perfectly normal.
From a performance point of view, it's probably not worth arguing about - parsing the CSV is probably just as slow as reading/writing from the database.
From a scalability point of view, you may have a problem. If the application design needs to capture in which order users joined the chat, or capture some kind of status (active/asleep, for instance), or provide a bit of history (user x exited at 21:20), you almost certainly end up re-designing the database.

Aggregation or composition or simple association?

There is one example to explaining associations in UML.
A person works for a company; a company has a number offices.
But I am unable to understand the relationship between Person, Company, and Office classes. My understanding is:
a company consists of many persons as employees but these classes exist independently so that is simple association with 0..* multiplicity on Person class' end
a company has many offices and those offices will not exist if there is no company so that is composition having Company as the parent class and 0..* multiplicity on Branch class' end.
But I am not sure of 2nd point. Please correct me if I am wrong.
Thank you.
Why use composition or aggregation in this situation at all? The UML spec leaves the meaning of aggregation to the modeler. What do you want it to mean to your audience? And the meaning of composition is probably too strong for this situation. Thus, why use it here? I recommend you use a simple association.
If I were you, I would stay truer to the problem domain. In the world I know, Offices don't cease to exist when a Company goes out of business. Rather, a Company occupies some number of Offices for some limited period of time. If a Company goes out of business, the Offices get sold or leased to some other Company. The Offices are not burned to the ground.
If you aren't true to the problem domain in an application, then the shortcuts you take will become invalid when the customer "changes the requirements" for that application. The problem domain doesn't actually change much, just the shortcuts you are allowed to take. If you take shortcuts to satisfy requirements in a way that are misaligned with the problem domain, it is expensive to adjust the application. Your customer becomes unhappy and you wind up working overtime. Save yourself and everyone the trouble!
While Jim's answer is correct, I want to add some extra information. There are two main uses for aggregation
Memory management
Database management
In the first case it gives a hint how long objects shall live. This is directly related to memory usage. If the target language is one which (like most modern languages) uses a garbage collector, you can simply ignore this model information.
In the second case, it's only partially a memory question. A composite aggregation in a database indicates that the aggregated elements need to be deleted along with the aggregating element. This is less a memory but in most cases a security issue. So here you have to think twice.
A shared aggregation however has a very esoteric meaning in all cases.

How to implement gapless, user-friendly IDs in NHibernate?

I'm designing an application where my Order objects need to have a sequential and user-friendly Id field. I'm avoiding the HiLo algorithm because of the rather large gaps it produces (see here). Naturally, Guid values would make my corporate users go bananas. I'm also avoiding Oracle sequences because of the major disadvantages of it:
(From: NHibernate POID Generators revealed)
Post insert generators, as the name
suggest, assigns the id’s after the
entity is stored in the database. A
select statement is executed against
database. They have many drawbacks,
and in my opinion they must be used
only on brownfield projects. Those
generators are what WE DO NOT SUGGEST
as NH Team.
> Some of the drawbacks are the
following:
Unit Of Work is broken with the use of
those strategies. It doesn’t matter if
you’re using FlushMode.Commit, each
Save results in an insert statement
against DB. As a best practice, we
should defer insertions to the commit,
but using a post insert generator
makes it commit on save (which is what
UoW doesn’t do).
Those strategies
nullify batcher, you can’t take the
advantage of sending multiple queries
at once(as it must go to database at
the time of Save).
Any ideas/experience on implementing user-friendly IDs without major gaps between them?
Edit:
User friendly Id fields are ones my corporate users can memorize and even discuss and/or have phone conversations talking about a particular Order by its code, e.g. "I'm calling to know why the order #1625 was denied.".
The Id doesn't need to be strictly gapless, but I am worried that my users would get confused when they see gaps like 100, 201, 305. For my older projects, I currently implement NHibernate using Oracle sequences which occasionally lose a few sequences when exceptions are thrown, but yet keep a rather tidy order to them. The downside to them is how they break the Unit of Work which results in additional hits to the database for every Save command with or without the Session.Flush.
One option would be to keep a key-table that simply stores an incrementing value. This can introduce a few problems, namely possible locking issues as well as additional hits to the database.
Another option might be to refine what you mean by "User-friendly Id". This could consist of a combination of a Date/Time and a customer-specific sequence (or including the customer id as well). Also, your order id does not necessarily have to be the actual key on the table. There is nothing to say that you can't use a surrogate key with a separate "calculated" column which represents the order id.
The bottom-line is that it sounds like you want to use a surrogate key, but have the benefits of a natural key. It can be very difficult to have it both ways and a lot comes down to how you actually plan on using the data, how users interpret the data, and personal preference.

Resources