I have a simple basic question. Assume i have a large website like facebook, gmail and so on. this site probably save hundreds of gigabytes information every day. My question is how these sites save this large information in their database(Because of database capacity). Is there only one database? Is there only one server for this site? If there is another server and database, how they can communicate with each others?
They are clearly not using one computer...
The system behind such large sites are very complex, and distributed across datacenters. See - http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/
Take a look at this site for info on various architectures employed by those sites (and this site): http://highscalability.com/all-time-favorites/
Most of these sites have gone with a strategy called NoSQL - that is they don't use traditional RDBM databases, but instead have created their own object relationship frameworks which have the ability to be persisted. This strategy works well at large scale as it drops a number of constraints which would seriously impact performance of traditional DB methods. However this generally comes at the cost of a lowering of reliability, which is generally considered acceptable for those sites' scenarios.
ps. if your question's general interest then no worries. If you're trying to build a highly scalable application hold off and consider it for a moment - are you going to be serving a significant percentage of the population of the world, or are you writing a site for maybe a few thousand users. If it's the latter you don't need Facebook style scaling; invest your effort and resources elsewhere. If it's the former start small then evolve your system, bringing in investment and expertise as your user base grows.
Related
On my new job there is a web-application written in Visual Basic .NET with usage of ASP.NET Webforms framework for producing and rendering of webpages.
It runs on a Windows server and requires Microsoft IIS web-server as an application host. The project is developed with Microsoft Visual Studio 2010
as a development environment and uses Intersystems Cache database. The application has a layered architecture (Interface -, Business -, Data access layer).
We use Firefox (78.1.0.esr(64-bit) as browser (internal policy).
Users complain that they don't know when a page is loading / request is being processed.
Apparently in the past Firefox visualized an hourglass when the page was loading.
What is the easiest way to visualize an hourglass for each request (independent of the page)?
It's a very large application.
For postbacks, you could use the asp:UpdateProgress control.
See: https://learn.microsoft.com/en-us/dotnet/api/system.web.ui.updateprogress?view=netframework-4.8
It allow you to display anything you want during the time the postback is being processed. I assume you could also use a css-class which turns your pointer into a hourglass if you wanted..
Well, as noted you don't mention any performance issues for what task?
And the browser display (or not) is a HUGE MASSIVE different issue here.
Browsers even Edge, or Firefox, and in fact all browsers I know of STILL display a loading animation. and that includes FireFox.
Edge browser: (based on Chrome engine now)
And FireFox
I am not aware of any changes in this area. However, it is BEYOND LAUGHABLE that some comments here suggest using ajax or some such!!! - I can provide some links to truck driving school, or maybe fast food training, as that is a laughable and gut busting suggestion here.
In fact?
Introduction of ajax calls WILL FOR SURE remove any browser "waiting" or circle delay animated icon that MOST ALL browsers have when a page is loading.
Next up:
Performance:
Well, does a page on the web site without any data load fast? In other words, do you have (or can you add a test page to the site - "hello world" on the page.
does that page load fast, or do all pages load slow. Or do pages with data operations and pages that pull data are slow?
And did the site run fast at one time, and is now with more data is it running slow? Or was it always slow? YOU REALLY need to answer the above questions.
Since you using a post-sql (or non sql database), then data in that system is actually saved in a format VERY similar to JSON, or XML. Just that the multi-value format used in that Cache database was invented in 1970, but that database uses strings for the data store.
In effect like say using google, or even SharePoint? You can have millions and millions of documents. Such computers even with low processor and low memory can easy say run a motor vehicles department for a whole country - 100 million people say.
So what they do AMAZING:
If you need to pull a patient record, or say a customer invoice? Such systems even with say 1 billion records can pull that WHOLE invoice out of the data base with ONE disk operation and seek. They are blistering fast WHEN retrieving what amounts to a master/child record compared to a sql system.
However, while those systems can say represent a whole invoice with one "string" (just like you can with XML or JSON?). What such systems do VERY poorly is "row" processing. So, get a record, modify that record, save it back? Fantastic performance.
However, row processing, or say using sql statements to update a lot of rows? That system is REALLY slow. And in fact, things are even worse since the ODBC drivers are in fact translating sql statements into "no sql, and into the string base database system).
So, I would ensure the data files are sized correctly. So, if that system used to run fast, but now is slow? The data files (their base size) need to be re-sized, and thus the huge mass of what we call "linked frames" will be reduced, and performance should increase huge. But you have to check the Cache database. (assuming that you have a few years of experience with that database system (so years of experience with a multi-value database is a basic requirement here).
So, do all web pages load slow - even ones without data? Then the can't be the database.
Or, is this occurring for some database pages that involve say updates?
And did the developers use the "data cube" like data objects, or do they use the SQL (odbc) translator for dealing with the database?
And was the system fast at one time? or was it always slow.
And do pages without much (or any) database operations run fast).
And VERY imporant:
Does the site run fast with say 5 users, but then run slow with 50 users?
(but then again, I can't imagine you not asked these super simple questions that anyone would when attempting to evaluate performance). I mean, you can have say "average" doctors, but when you have a special medical mystery, then you need a Dr. House here - the best of the best). Same goes with computers. It is possible you are making the assumption that the original developers of the software and system were drunken unemployed rodeo clowns, but then again, maybe they did a REALLY good job, and the database or software been out grown due to server high loads during peak times. But then again, perhaps worse that these "basic" questions about performance have not yet been asked by you?
Out of one of our computing science classes, there was only about 2-3 out of a class of about 80 that would "naturally" write the fastest running code. (these days, unfortantly tracking things like code execution time and other metrics is often not even considered anymore, but it should be). Same goes for say formula 1 racing. You can take two teams, each with a 400 million dollar budget, but one team BLOWS AWAY the competition, and yet all the teams hire the best talent money can buy. So, hiring some developers to build software? Sure, lots of developers. Hiring developers with top notch performance in mind? Well, then that's when you seek out the big guns, the guru's - the top talent.
and even other basic questions such as :
Does the system run well on the test or development box?
Did the system run well at one time, but is not slower over time?
Or was it always slow? (and then what prompted this issue and idea that performance now needs to be fixed and addressed compared to no one doing anything 5 years ago?).
And when the site was becoming slow, were the developers of the site contacted? (or why, or why not?).
But, talking about a cube like "no-sql" database? And THEN ALSO introduction of a simple browser question of which all have that "animation" icon? That is two VERY different questions here. Have to really wonder how both of this issues could be mixed up and introduced to the same post and question? (yikes!!!!).
All current browsers I am aware of do have a "wait" type of animation. But such animation have VERY little to do with application and database performance optimizing. Toss in that you using a so called post-relational, or so called multi-value database, then you introducing an area of expertise that most posters here likely don't have. (so you getting silly suggestions about ajax and the like).
I have 10+ years of experience on those multi-value databases, and as noted, they are not fast at row processing, but pulling, update and save of a record? Then such systems can easy beat sql based systems performance wise. So, the fine art of performance? It is without question a question and process for the "top dogs" talent wise in our industry when attempting to deal with performance.
So, was the system fast at one time, and now is it slow?
Are all pages - even those without data from databased slow?
Or are only some pages with data operations slow?
Are they using Cache data objects, or are they using the database provider and sql?
so, what type of data provider(s) are being used in this application?
I am looking for some light in the complexity of architectural selection, before starting the development of a CMS or CRM or ERP.
I was able to find this similar question: A CRM architecture (open source app)
But it seems old enough.
I watch and read recently several conferences, discussions about monolith vs distrubuted, DDD philosophy, CQRS and event driven design, etc.
And I panic even more than before on the architectural choice, having taken into account the flaws of each (I think).
What I find unfortunate with all the examples of microservices and distributed systems that can be found easily on the net is that they always take e-commerce as an example (Customers, Orders, Products ...). And for this kind of example, several databases (in general, a NoSQL DB by microservice) exist.
I see the advantage (more or less) ==> to keep a minimalist representation of the necessary data for each context.
But how to go for a unique and relational database? I really think I need a single relational database, having worked in a company producing a CRM (without access to the source code of the machine, but the structure of the database), I could see the importance of relational: necessary for listings, reports, and consult the links between entities within the CRM (a contact can have several companies and conversely, each user has several actions, tasks, but each of his tasks can also be assigned to other users, or even be linked to other items such as: "contact", "company", "publication", "calendarDate", etc. And there can be a lot of records in each table (+ 100,000 rows), so the choice of indexes will be quite important, and transactions are omni-present because there will be a lot of concurrent access to data records).
What I'm saying to myself is that if I choose to use a microservice system, there will be a lot of microservices to do because there would really be a lot of different contexts, and a high probability of having a bunch of different domain models. And then I will end up having the impression of having to light each small bulb of a garland, with perhaps too much process running simultaneously.
To try to be precise and not go in all directions, I have 2 questions to ask:
Can we easily mix the DDD philosophy with a monolith system, while uncoupling very small quantity (for the eventual services that should absolutely be set apart, for various reasons)?
If so, could I ask for resources where I can learn a lot more about this?
Do we necessarily have to work with a multitude of databases, and should it necessarily be of the kind mongoDb, nosql?
I can imagine that the answer is no, but could I ask to elaborate a little more? Or redirect me to articles that will give me clear enough answers?
Thank you in advance !
(It would be .NET Core, draft is here: https://github.com/Jin-K/simple-cms)
DDD works perfectly as an approach in designing your CRM. I used it in my last project (a web-based CRM) and it was exactly what I needed. As a matter of fact, if I wouldn't have used DDD then it would have been impossible to manage. The CRM that I created (the only architect and developer) was very complex and very custom. It integrates with many external systems (i.e. with email server and phone calls system).
The first thing you should do is to discover the main parts of your system. This is the hardest part and you probably get them wrong the first time. The good thing is that this is an iterative process that should stabilize before it gets to production because then it is harder to refactor (i.e. you need to migrate data and this is painful). These main parts are called Bounded contexts (BC) in DDD.
For each BC I created a module. I didn't need microservices, a modular monolith was just perfect. I used the Conway's Law to discover the BCs. I noticed that every department had common but also different needs from the CRM.
There were some generic BCs that were common to each department, like email receiving/sending, customer activity recording, task scheduling, notifications. The behavior was almost the same for all departments.
The department specific BCs had very different behaviour for similar concepts. For example, the Sales department and Data processing department had different requirements for a Contract so I created two Aggregates named Contract that shared the same ID but they had other data+behavior. To keep them "synchronized" I used a Saga/Process manager. For example, when a Contract was activated (manually or after the first payment) then a DataProcessingDocument was created, containing data based on the contract's content.
Another important point of view is to discover and respect the sources of truth. For example, the source of truth for the received emails is the Email Server. The CRM should reflect this in its UI, it should be very clear that it is only a delayed reflection of what is happening on the Email Server; there may be received emails that are not shown in the CRM for technical reasons.
The source of truth for the draft emails is the CRM, with it's Email composer module. If a Draft is not shown anymore then it means that it has been deleted by a CRM user.
When the CRM is not the source of truth then the code should have little or no behavior and the data should be mostly immutable. Here you could have CRUD, unless you have performance problems (i.e. millions of entries) in which case you could use CQRS.
And there can be a lot of records in each table (+ 100,000 rows), so the choice of indexes will be quite important, and transactions are omni-present because there will be a lot of concurrent access to data records).
CQRS helped my a lot to have a performant+responsive system. You don't have to use it for each module, just where you have a lot of data and/or different behavior for write and read. For example, for the recording of the activity with the customers, I used CQRS to have performant listings (so I used CQRS for performance reasons).
I also used CQRS where I had a lot of different views/projections/interpretations of the same events.
Do we necessarily have to work with a multitude of databases, and should it necessarily be of the kind mongoDb, nosql? I can imagine that the answer is no, but could I ask to elaborate a little more? Or redirect me to articles that will give me clear enough answer
Of course not. Use whatever works. I used MongoDB in 95% of cases and Mysql only for the Search module. It was easier to manage only a database system and the performance/scalability/availability was good enough.
I hope these thoughts help you. Good luck!
I have what I consider to be a fairly simple application. A service returns some data based on another piece of data. A simple example, given a state name, the service returns the capital city.
All the data resides in a SQL Server 2008 database. The majority of this "static" data will rarely change. It will occassionally need to be updated and, when it does, I have no problem restarting the application to refresh the cache, if implemented.
Some data, which is more "dynamic", will be kept in the same database. This data includes contacts, statistics, etc. and will change more frequently (anywhere from hourly to daily to weekly). This data will be linked to the static data above via foreign keys (just like a SQL JOIN).
My question is, what exactly am I trying to implement here ? and how do I get started doing it ? I know the static data will be cached but I don't know where to start with that. I tried searching but came up with so much stuff and I'm not sure where to start. Recommendations for tutorials would also be appreciated.
You don't need to cache anything until you have a performance problem. Until you have a noticeable problem and have measured your application tiers to determine your database is in fact a bottleneck, which it rarely is, then start looking into caching data. It is always a tradeoff, memory vs CPU vs real time data availability. There is no reason to make your application more complicated than it needs to be just because.
An extremely simple 'win' here (I assume you're using WCF here) would be to use the declarative attribute-based caching mechanism built into the framework. It's easy to set up and manage, but you need to analyze your usage scenarios to make sure it's applied at the right locations to really benefit from it. This article is a good starting point.
Beyond that, I'd recommend looking into one of the many WCF books that deal with higher-level concepts like caching and try to figure out if their implementation patterns are applicable to your design.
While this question has been asked in a variety of contexts before, I can't find any information pertaining specifically to sites targeting very large audiences - for example on the scale of hundreds of thousands or even millions of users.
When writing sites that target smaller audiences (such as intranet hosted data driven sites that handle from a few to a few thousand users) we only tend to follow best practices within the confines of our project budgets/deadlines - i.e. developer costs, rollout schedules and maintainability have a far bigger impact than we would often like on how we code things.
Some things are also negligible (to a point), for instance delivery time, image compression/size, bandwidth because the nature of a LAN hosted application tends to mean that there is a relatively small amount of financial cost that (within reason) we don't need to worry about too much.
However, when looking to target a much broader audience for instance an audience of (hopefully) millions of users:
Are there any best practices that no longer need to be worried about (i.e. become more negligible the larger the audience)?
Are there any practices that should be adhered to even more tightly?
Also, are there any practices that only really come into play as your audience achieves some critical mass [and what would that critical mass be]? i.e. applying artificial constraints that wouldn't begin to concern you on a private network
Examples I've come across so far are:
Host codebases such as jQuery on Google as it's delivered from Google's CDN and can be served much faster than from your own servers. This will also help keep bandwidth costs down for delivery of your site.
Host images on a CDN for the same reason as hosting your javascript code elsewhere.
I guess it depends on what one aims for on the "triangle" of pressures: CAP (Consistency, Availability & Tolerance to Partition). E.g. one can only have so much "C" when faced with network disruptions which incur "P".
Nowadays, it would appear that the accent is put more on delivering "good user experience" which seems to hinge on "Time to Result" (e.g. having a complete web page on the user's desktop): this translate to investing (amongst other things) more on the "A" and "P" sides then the "C" one.
More concretely: spend some time deciding when to perform data aggregation for the presentation layer to your users e.g. can I aggregate this data over a longer time period before recomputing another view to push?
Of course, I am only barely scratching the surface of the problem.
I think there are three big things to keep in mind here:
a) You aren't going to write the next twitter/youtube/facebook/ebay/amazon/whatever. It don't happen too often so it is a big case of YAGNI.
b) If you do happen to write one of those, chances are you'll have the opportunity to rewrite the application more than a few times.
c) Only object lesson from any of the architecture types who have spoken publicly about those apps is that scaling horizontally is the way to go. Vertical maxes out real, real quick.
Also, I'd argue that process improvements become much bigger at these lofty scales. You will have legions of developers, strict deployment windows and lots of boxes to worry about. It had better be real scripted, automated and repeatable.
I would check out YSlow and follow their reccomendations with regards to improving performance.
#jldupont - Just looked at the presentation that you have linked to. One thing that I didn't get is that how come "Distributed Databases" is an example scenario when you lose Availability to gain Consistency and Partitioning.
I think for distributed databases you lose Consistency.
I've always been of the opinion an internal development group should really only be building/maintaining three applications.
An internal composite/pluggable/extendable application.
The company website.
(Optional) A mobile version of #1 for field employees.
I'm a consultant, and everywhere I go, my clients have dozens of one-off applications in the web and on the desktop for every need no matter how related to the others. Someone comes to IT and says "I need this", and IT developers turn around and write another one-off ASP.NET application, or another WinForms app.
What's your opionion? Should I embrace the "as many apps as we want/need" movement? I assume it's common; but is it sensible?
EDIT:
A colleague pointed out that it depends on the focus of the development - are you making apps or are you making a system? I guess to me, internal development is about making a system; development of shippable software products, like MS Word, iTunes, and Photoshop, is about making apps.
All of them?
Wow do I ever agree with you. The problem is that many one-off applications will (at some point) each have many one-off maintenance requests. Anything from business rule updates to requests for new reports. At some point the ratio of apps that need to be maintained to available development staff is going to be stretched/taxed.
From my perhaps (limited?) vantage point, I'm starting to think #1 and #3 could be boiled down to Sharepoint. Most one-off applications where I work (a large 500+ attorney law firm) consist of one or more of the following:
A wiki
A blog
Some sort of list (or lists joined together in some type of relationship), which can be sorted and arranged in different ways.
A report (either a Sharepoint data view or a SQL Server Report work just fine)
Or, the user just wants to "make a web page" and add content to it. But only they should be able to edit it. Except when they're out of the office, and then, etc...
Try to build any one of the above using [name your technology], and you've got lots of maintenance cycles to look forward to (versus a relatively minor Sharepoint change).
If I could restate what I think is your point: why not put most of your dev cycles to work improving and maintaining a single application that can support most of your business' one-off needs, rather than cranking out an unending stream of smallish speciality apps?
This question depends on so many things and is subjective besides. I've worked at companies that have needed several different apps because we do business in discreet silos. In that case, an internal group may not build and maintain apps, but may build several, with another group that is responsible for maintenance.
Also, what do you mean by "app"? If you broaden the term enough, then you could say "it's all just one big app".
In short, I think the main consideration is the capacity of the group and what business needs are.
I think there should be internal development teams that each has a system which may contain multiple applications within it. To take a few examples of what I mean by systems:
ERP - If you are a manufacturer of products, you may need a system to keep track of inventory, accounting of books and money, and other planning elements. There are a wide range of scales of such systems but I suspect in most cases there is some customization done and that is where a team is used and may end up just doing that over and over if the company is successful and a new system is needed to replace the previous one as these can take years to get fully up and running. The application for the shop floor is likely not the same one as what the CFO needs in order to write the quarterly earnings numbers to give two examples here.
CRM - How about tracking all customer interactions within an organization that can be useful for sales and marketing departments? Again, there are many different solutions and generally there is customizations done which is another team. The sales team may have one view of the data but if there is a support arm to the company they may want different data about a customer to help them.
CMS - Now, here I can see your three applications making sense, but note what else there is beyond simple content.
I don't think I'd want to work where everything is a home grown solution and there is no outside code used at all. Lots of code out there can be used in rather good ways such as tools but also components like DB servers or development IDEs.
So what's the alternative to several one-off applications? One super-huge application that runs everything and everything? That seems even worse to me...