Aggregation or composition or simple association? - associations

There is one example to explaining associations in UML.
A person works for a company; a company has a number offices.
But I am unable to understand the relationship between Person, Company, and Office classes. My understanding is:
a company consists of many persons as employees but these classes exist independently so that is simple association with 0..* multiplicity on Person class' end
a company has many offices and those offices will not exist if there is no company so that is composition having Company as the parent class and 0..* multiplicity on Branch class' end.
But I am not sure of 2nd point. Please correct me if I am wrong.
Thank you.

Why use composition or aggregation in this situation at all? The UML spec leaves the meaning of aggregation to the modeler. What do you want it to mean to your audience? And the meaning of composition is probably too strong for this situation. Thus, why use it here? I recommend you use a simple association.
If I were you, I would stay truer to the problem domain. In the world I know, Offices don't cease to exist when a Company goes out of business. Rather, a Company occupies some number of Offices for some limited period of time. If a Company goes out of business, the Offices get sold or leased to some other Company. The Offices are not burned to the ground.
If you aren't true to the problem domain in an application, then the shortcuts you take will become invalid when the customer "changes the requirements" for that application. The problem domain doesn't actually change much, just the shortcuts you are allowed to take. If you take shortcuts to satisfy requirements in a way that are misaligned with the problem domain, it is expensive to adjust the application. Your customer becomes unhappy and you wind up working overtime. Save yourself and everyone the trouble!

While Jim's answer is correct, I want to add some extra information. There are two main uses for aggregation
Memory management
Database management
In the first case it gives a hint how long objects shall live. This is directly related to memory usage. If the target language is one which (like most modern languages) uses a garbage collector, you can simply ignore this model information.
In the second case, it's only partially a memory question. A composite aggregation in a database indicates that the aggregated elements need to be deleted along with the aggregating element. This is less a memory but in most cases a security issue. So here you have to think twice.
A shared aggregation however has a very esoteric meaning in all cases.

Related

Doing compex reports with microservices

I'm starting a new project and am interested in architecting it as microservices. I'm trying to wrap my head around it:
Say that I have an order service and a product service. Now I want to make a report service that gives me all orders that contain a product from a certain product category.
Since order's dont know about products that means that I would need to fetch all orders, loop them and fetch products for each order and then return those how match.
Is this assumption correct or is there any more efficient way of doing this with microservices?
In a microservices architecture, the procedure is to distill the use cases and the service boundaries of the application. In the question above, there are at least two service boundaries, namely one for transactions and another for reporting.
When you have two different service boundaries, the typical approach is to duplicate some data elements between them eg. whenever you make a sale, the data, should be sent to both the reporting and transactional services. One possible approach of broadcasting the data to the different boundaries is to use a message queue. Duplicating the data allows them to be evolve and operate independently and become self sufficient which is one of the goals of microservices.
A personal word of advice though, you might want to start with a monolith before going the microservices route. Microservices are generally more operationally heavy; it will be difficult to reason about its advantages during the initial application stages. It tends to work better after having developed the monolithic application since it would be easier to see what didn't work and what could be improved by a microservices-like system.

Why does Hyperloglog work and which real-world problems?

I know how Hyperloglog works but I want to understand in which real-world situations it really applies i.e. makes sense to use Hyperloglog and why? If you've used it in solving any real-world problems, please share. What I am looking for is, given the Hyperloglog's standard error, in which real-world applications is it really used today and why does it work?
("Applications for cardinality estimation", too broad? I would like to add this simply as a comment but it won't fit).
I would suggest you turn to the numerous academic research of the subject; usually academic papers contain some information of "prior research on the subject" as well as "applications for which the subject has been used". You could start with traversing the references of interest as referenced by the following article:
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm, by P. Flageolet et al.
... This problem has received a great deal of attention over the past
two decades, finding an ever growing number of applications in
networking and traffic monitoring, such as the detection of worm
propagation, of network attacks (e.g., by Denial of Service), and of
link-based spam on the web [3]. For instance, a data stream over a
network consists of a sequence of packets, each packet having a
header, which contains a pair (source–destination) of addresses,
followed by a body of specific data; the number of distinct header
pairs (the cardinality of the multiset) in various time slices is an
important indication for detecting attacks and monitoring traffic, as
it records the number of distinct active flows. Indeed, worms and
viruses typically propagate by opening a large number of different
connections, and though they may well pass unnoticed amongst a huge
traffic, their activity becomes exposed once cardinalities are
measured (see the lucid exposition by Estan and Varghese in [11]).
Other applications of cardinality estimators include data mining of
massive data sets of sorts—natural language texts [4, 5], biological
data [17, 18], very large structured databases, or the internet graph,
where the authors of [22] report computational gains by a factor of
500+ attained by probabilistic cardinality estimators.
At my work, HyperLogLog is used to estimate the number of unique users or unique devices hitting different code paths in online services. For example, how many users are affected by each type of service error? How many users use each feature? There are MANY interesting questions HyperLogLog allows us to answer.
Stackoverflow might use hyperloglog to count the views of each question. Stackoverflow wants to make sure that one user can only contribute one view per item so every view is unique.
It could be implemented with set. every question would have a set that stores the usernames:
question#ID121e={username1,username2...}
For each question creating a set would take up some space and consider how many questions have been asked on this platform. The total amount of space to keep track of every view per user would be huge. But hyperloglog uses about 12 kB of memory per key no matter how many usernames are added, even 10 million views.

StatsD/Graphite Naming Conventions for Metrics

I'm beginning the process of instrumenting a web application, and using StatsD to gather as many relevant metrics as possible. For instance, here are a few examples of the high-level metric names I'm currently using:
http.responseTime
http.status.4xx
http.status.5xx
view.renderTime
oauth.begin.facebook
oauth.complete.facebook
oauth.time.facebook
users.active
...and there are many, many more. What I'm grappling with right now is establishing a consistent hierarchy and set of naming conventions for the various metrics, so that the current ones make sense and that there are logical buckets within which to add future metrics.
My question is two fold:
What relevant metrics are you gathering that you have found indespensible?
What naming structure are you using to categorize metrics?
This is a question that has no definitive answer but here's how we do it at Datadog (we are a hosted monitoring service so we tend to obsess over these things).
1. Which metrics are indispensable? It depends on the beholder. But at a high-level, for each team, any metric that is as close to their goals as possible (which may not be the easiest to gather).
System metrics (e.g. system load, memory etc.) are trivial to gather but seldom actionable because they are too hard to reliably connect them to a probable cause.
On the other hand number of completed product tours matter to anyone tasked with making sure new users are happy from the first minute they use the product. StatsD makes this kind of stuff trivially easy to collect.
We have also found that the core set of key metrics for any teamchanges as the product evolves so there is a continuous editorial process.
Which in turn means that anyone in the company needs to be able to pick and choose which metrics matter to them. No permissions asked, no friction to get to the data.
2. Naming structure The highest level of hierarchy is the product line or the process. Our web frontend is internally called dogweb so all the metrics from that component are prefixed with dogweb.. The next level of hierarchy is the sub-component, e.g. dogweb.db., dogweb.http., etc.
The last level of hierarchy is the thing being measured (e.g. renderTime or responseTime).
The unresolved issue in graphite is the encoding of metric metadata in the metric name (and selection using *, e.g. dogweb.http.browser.*.renderTime) It's clever but can get in the way.
We ended up implementing explicit metadata in our data model, but this is not in statsd/graphite so I will leave the details out. If you want to know more, contact me directly.

One massive instance of an app, or many medium-sized ones?

A web application we wrote intended for one customer is going to be product-ized and sold to dozens of companies, and we will be doing the hosting.
I could use some guidance about the pros and cons of rolling out a seperate instance for each customer versus going with a single (or very small number of) multi-tenant instances.
At first, as we ramp up, I will have to roll out a seperate instance of the application for each new customer (they will come online one at a time) because it's the only immediate option. I imagine this won't scale very well as far as maintenance goes - rolling out changes will become very tedious and possibly error-prone once there are more than 4 or 5 instances out there. Unless we automate that somehow.
Also, the single-instance philosophy seems like it might lead to a bunch of forks if people need customizations. And it would be nice to avoid that.
So what has your experience been with this?
Bonus question #1: What's the performance difference between 10 SQL Servers with 2m records each versus one huge one with 20m? Let's say they are all in one table and we're mainly doing inserts and selects on single records. Sometimes the selects are on an indexed varchar(12) or date field.
Bonus Question #2: I imagine that to avoid forking, we would have to make the customizations configurable, or build a plug-in architecture. However, that might increase the cost of doing customizations, and I don't want to be one of those shops that takes a week to resize a textbox, and I don't want to over-invest in infrastructure. Any thoughts on that?
Scale Details
Each customer will have a decent amount of data -- up to a few million records.
There will be a very small number of concurrent users, only a few per customer, plus a handful of internal reps on our end.
It's unclear whether each customer will require customizations, but I would say some of them probably will, and maybe some of those changes will be things that other customers will not want to see.
when faced with a similar challenge, here's what we did:
we have one code base with multiple sql servers. we do maintain multiple iis servers with copies of the same code base. we are free to move clients around from sql server to sql server to maximize performance.
if a customer has the $ for it, we will install them on their own server and maintain a separate iis server for them. this accommodates the largest customers for whom paying much more money every month (10 fold more money). we do not, however, give them a separate code base. if they need a mod, we make it visible on a per client basis (see #3)
custom programming usually results in a configurable option. even the people who pay us to have their own server get the same version of the code. sometimes its as simple as a clause in the code that says "if the customer = "ourbigcustomer then turn on this option". yes, that's kludgy hard-coding, but if the customer has enough money, that is fine with me.
i didn't quite get from your question whether you wanted to mix different customer's data into one big database .. our rule is we never do that (never ever). it is one of the wisest choices we ever made. it makes data manipulation much less risky and restores of data easier.
I don't see a good reason for either of your two options. I think the real answer lies somewhere in the middle: having multiple instances, each hosting multiple clients.
This adds another layer of automation processing, but it means you can keep the hosting cheap (you won't need to go out and buy a Cray any time soon) and (hopefully) this sort of mentality means you could do failover backups fairly easily.
But let's not get ahead of ourselves... We're talking about a webapp, right? Get your database(s) and aspnet on different machines. Cluster your databases and you'll have a much happier time playing around with various front-end scenarios. You'll also be able to upscale whichever area runs out of puff first.
By the sounds of it, you'll end up with one clustered database over half if not a full dozen database machines and only a couple of front-end boxes.
As for customisations, you've nailed it. You either provide a completely database-hosted set of editable templates or you have to customise who instances. I'm all for the first. It's a lot of work (without much in return) but it's well worth it as you should only need to change the core code when (you will!) you do upgrades. Hunting through a hundred customers' custom instances to make sure they upgrade safely will kill a developer! Template are the answer. At the very very least, you could allow custom CSS without much pain (but they'd need somebody who knew their stuff).
Edit: I've seen a couple of posts going for the all-in-one method. Splitting the instances over multiple machines insulates you from a couple of things:
If you introduce a bug not caught in testing, only a few clients are effected at once
Hardware fails. Having one mega-server fall over will annoy a lot of people at once. Having a failover mega-server is massively expensive. Having a spare failover box per three or four running servers is much cheaper and annoys fewer people.
Performance can be balanced between boxes on a client-by-client basis, so you can put a few light-use clients with a heavy client, or just fill a box with a few medium-use clients, etc.
On the same idea, usage spikes or other slowdowns only effect clients on the same box. Of course this doesn't mean the same for the database, but you can split that up into a cluster of clusters when you get there.
The big advantage of individual instances will be scaling out as each customer's demand increases. For example if you're running on a single server and one customer suddenly needs more preformance you're stuffed. But if they're all individual then moving that customer to a shiny new server is relatively easy.
The big disadvantage will be in managing the instances all individually. (regardless of whether they're all running on the same server or not).
Regardless you should only ever have one instance of the codebase. And customisation should all be controlled through plugins and configuration. Front end should naturally be seperate from content. Although the cost of making a change may be higher, the benefit in terms of features you can offer your other customers (which will just be customisations you've been asked to do) will pay off I'm sure. Which is to say nothing as to how much easier it'll be to manage a single codebase, as opposed to several.
I would strongly advise going with the single instance hosted by your company. This has the following advantages:
You have physical access to all code
and databases to make changes and
updates.
You control the quality of the
hardware it is running on.
When you fix a bug in common code,
you have fixed it once for all
customers.
You can refactor the application
design to better support customer
specific code and avoid forking.
As the number of customers grow, you
can scale-up and scale-out your
servers to meet
performance/responsiveness
requirements.
Your application code and databases
cannot be tampered with by
"inquistive" customers.
I would have to say it is almost more important where your application is running as opposed to how many separate instances there are of it.
Sure, maintaining multiple separate instances is not ideal due to the support/maintenance overhead, but if these apps. are all on servers you control, life is much easier then needing remote/ physical access to different customers networks and servers.
Joel Spolsky also talks about exactly this on StackOverflow podcast 67.
One thing Joel has learned from
selling Fogbugz: software designed to
be installed on a server in-house at a
customer’s site, under full control of
that customer, is almost never worth
the hassle
20 million records relatively speaking is not a huge SQL Server database. A single well provisioned SQL Server could handle this size comfortably. More important however is the number of concurrent accesses to the database. However you say that there will be only a few users per customer so is unlikely to hit you until the level of concurrency grows.
All of the above are good points but you are missing two key questions. What price point is the service offered at and how many customers (order of magnitude) will you ultimately have to support (ie market size)? In 3 years will you have a maximum of 10 customers each of which will pay you $500,000 per year or 500 customers each paying you $10,000 per year? For a small set of high paying premium customers the advantages of individual deployments is clear, whereas the lower prices and larger customer bases demand a shared solution (a la Oli's comment) is the best way to go. Or go with a cloud platform, although I've only read the hype and tinkered rather than deployed that in the field.
Bonus Question 1: table layout, indexing, number of reads / writes, efficiency and complexity of stored procedures (you are using procs or at least prepared statements, right?) all matter a heck of a lot more than the number of physical records in the database to a point. Beyond that you will likely find yourself needing to either provide individual SQL Server instances for each customer or for a pool of customers, once again depending on some of the questions I raised above.
Bonus Question 2: Putting the time into your design for templating and a plugin architecture is essential in this situation and you need to do it sooner rather than later. Once you're in the grind of customizing code for paying customers you will likely not have the time to do it right. This point cannot be stressed enough. Templates and admin tools that give you quick and deep access to data-driven changes in your product will save you a lot of time down the road. As your company / group expands you can then add less technical staff that can be "product experts" who can perform 90% of customizations and maintenance, freeing up your core to continue development or move on to other projects. Finally, don't neglect your data tier in this planning process. Having a core data tier of (almost) immutable stored procs and tables is very important, with custom tables and stored procs clearly demarcated using a good naming convention.
Good luck, feel free to provide more details if you'd like more specific suggestions.
Based on some of the advice received here, we did end up implementing a monolithic multi-tenant version of our application.
I'm glad we did. By the time it was done, we had 3 or 4 forks of the code base (mainly custom skins and things we didn't have n-level support for, but also some actual features), and it was only getting crazier.
We got the multi-tenant version up and successfully folded everything in. There ended up being a lot to think about and a lot to keep track of, but our customers never even knew they had been moved to a new system.
I will say that the actual customer migration was a bit of a bear. I thought at first that we would be able to do it by hand in the backend, but I ended up having to write some fairly involved scripts to get the job done. There were just too many identity columns, and it's not like you can just turn off constraints temporarily when you're importing into a live production system.

When to separate columns into new table

I have company, customer, supplier etc tables which all have address information related columns.
I am trying to figure out if I should create a new table 'addresses' and separate all address columns to that.
Having address columns on all tables is easy to use and query but I am not sure if it is the right way of doing it from a good design perspective, having these same columns repeat over few tables is making me curious.
Content of the address is not important for me, I will not be checking or using these addresses on any decision making processes, they are purely information related. Currently I am looking at 5 tables that have address information
The answer to all design questions is this:
It depends.
So basically, in the Address case it depends on whether or not you will have more than 1 address per customer. If you will have more than 1, put it in a new Addresses table and give each address a CustomerID. It's overkill (most times, it depends!) to create a generic Address table and map it to the company/customer/supplier tables.
It's also often overkill (and dangerous) to map addresses in a many-to-many relationship between your objects (as addresses can seem to magically change on users if you do this).
The one big rule is: Keep it simple!
This is called Database Normalization. And yes, you want to split them up, if for no other reason because if you need to in the future it will be much harder when you have code and queries in place.
As a rule, you should always design your database in 3rd Normal Form, even for simple apps (there will be a few cases where you won't for performance or logistic reasons, but starting out I would always try to make it 3rd Normal Form, and then learn to cheat after you know the right way of doing it).
EDIT: To expand on this and add some of the comments I have made on other's posts, I am a big believer in starting with a simple design when it comes to code and refactoring when it becomes clear that it is becoming too complex and more indepth object oriented principles would be appropriate. However, refactoring a database that is in production is not so simple. It is all about ROI. It is just too easy to design a normalized database from the outset to justify not doing it. The consequences of a poorly designed database can be catastrophic and it is usually too late before you come to that realization.
Yes, you should separate the addresses to a table of their own. It's a smart thing to know to ask. The key here is that general format of addresses is the same, regardless of who it is; a customer, a company, a supplier... they all have the same fields for addresses.
What makes this worthwhile is the ability to treat addresses as an atomic element; that is, you can generalize all the functionality related to addresses and have it deal with just one table, as opposed to having to worry about it dealing with several tables, and the associated schema drift that can occur.
If you are using those addresses only within the scope of their own tables, there may be no real benefit to moving them to their own tables.
Basically, it doesn't sound like it's worth the effort.
If there's an overlap between tables (i.e. the same organization is entered in both the company and supplier tables), and the address should always be the same in both tables, then it's probably worth moving address off in to its own table and having foreign keys to it from your other three tables. That way, you only have to update it in one spot when it changes.
If the three tables are entirely independent from each other, then there's not really much to gain from moving the data to another table, so you might as well leave it alone.
I think it entirely depends on the purpose of the database. Admittedly all address information is structurally the same and from a theoretical standpoint should all be in a single table linked from the parent table by a key.
However from a performance and query perspective, keeping them in their respective tables does simplify things from a reporting standpoint.
I have a situation with my current company [logistics] where the addresses are actually logically the same - they're all locations regardless of whether they're a pickup location, delivery location, customer etc.
In my case, I'd say that they should most definitely all be in one table. But if it's looking at it from a supplier, customer, contact information standpoint, I'd say that while theoretically it's nice to have the addresses in one table, in practice it won't buy you a whole lot as the data is unlikely to be repeated.
I disagree with Dave. The many-to-many approach (Address <-> User) is both safe, and highly advantageous.
When a customer moves, the addresses in the Address table does NOT change. Instead, the new address is found in the Address table, and the customer etc. is linked to that record. If the new address isn't already in the table, it's added to it.
So do address records themselves ever change? Yes, in cases like these:
it turns out that the address has a typo
US postal service changes the street name
These are the very situations where putting all addresses in one table without repetition pays off; any other arrangement would require an annoying and repetitive data entry.
Of course, if the database is abused, then it would be safer to avoid the many-to-many relationship. But by that token, if the database is in bad hands, it's better to just print everything out, store it in a file cabinet, and verify every transaction against the paper copy. So "protection against misuse" is not a good design principle, in my opinion.

Resources