Can someone please explain how concurrency affects a web application?
Anything helps!
Thanks.
This is a super broad question but the common sources of concurrency errors from base of app up are:
DB Level (MVCC)
Runtime Level (programming language, concurrency model, threads, locks, race, conditions, etc)
Web server level (Multiple client requests on the same endpoints at the same time)
Logical: CQRS (Eventual Consistency, Writes and reads are separate, reads can be stale/lagged)
Logical: Distributed transactions (Imagine having a read cache like redis and a store like mysql, how do you guarantee these are in sync?)
I can't recommend Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems highly enough which provides a foundation into all the important concepts your concerned about!
A concrete example using CQRS:
There is a DB store
There is a single Writer
There is a single Reader
A transaction makes a request to persist information to the writer and then immediately reads back from that reader. Depending on your CQRS implementation it's possible there are only eventual consistent guarantees between writes and reads, meaning the read may not see the just written data! Of course this may ore may not be an issue depending on your client.
Related
Meteor's DDP protocol works very well for syncing a small collection of data from a server to a browser-based client, which inherently limits the amount of data that is processed.
However, consider a situation where Meteor is being used to sync a large collection from one server to another, or just the DDP protocol itself is used to sync one MongoDB with another.
How efficient is DDP in this case (computationally)? How well does it scale to several clients? Is the limit to performance only bandwidth or will DDP hit some CPU bound as well? What is the largest amount of data that can be reasonably synced over DDP right now? Is DDP just the wrong approach for doing this (see references below)?
Some additional thoughts:
As far as I know, the current version of DDP keeps track of each client's entire collection, so it can't be asymptotically very efficient.
Smart Collections were created to improve the performance of server-to-client collection of syncing. But it's unclear to me if this is improving DDP or something else.
See also:
How to implement real-time replication of MongoDB (or CouchDB) to many remote clients
DDP vs Straight MongoDB access for synching large amounts of data
EDIT:
After some empirical experience with this, I have to conclude that the answer is "not very efficient". See https://stackoverflow.com/a/21835534/586086 for an explanation.
Discussions with Meteor devs indicated that this problem will be addressed in the future with a revision of DDP and the publish-subscribe API, whereby the merge box will be removed and clients will handle merging. This will save CPU/memory on the server and allow for much larger datasets to be sent over the wire.
Basically it is more a matter of what and how you are publishing to the client than the number of clients. A request is usually handled in log2(N) if indexed, therefore it is quite easy for the server to recompute the result set even if (in the worst case) the whole collection would change. So, from the server side you can quite quickly get the new result sets to publish to the clients (if they changed from the one they had already).
The real problem (and common error) comes when you do publish everything to the client (like with the former autopublish), so make a publication wisely so that you do only give what the client is supposed to see. You can either prune the documents hiding useless fields or reduce the result set to send to the client by creating a publication with specific to your data scope of use parameters.
Data reactivity (session parameter bound on a publication) should also be handled with care, if for example you are sending a request each time you press a key in the search field, you might quickly overload the connection (still strongly depending on the size of the set you are publishing). We had to take care of this trying to build real estate service over meteor, the data set being over several gigabytes it was quite challenging to handle this without blocking the pipe with overloaded data.
In term of bandwidth, the DDP is quite good because it does supports clever entries updating (sending only fields changes instead of the whole document), moving an item is (will be) supported to (server side reordering).
Also take a look on this excellent answer concerning huge collections, what is done under the hood.
I'm about to begin designing the architecture of a personal project that has the following characteristics:
Essentially a "game" containing several concurrent users based on a sport.
Matches in this sport are simulated on a regular basis and their results stored in a database.
Users can view the details of a simulated match "live" when it is occurring as well as see results after they have occurred.
I developed a similar web application with a much smaller scope as the previous iteration of this project. In that case, however, I chose to go with SQLite as my DB provider since I also had a redistributable desktop application that could be used to manually simulate matches (and in fact that ran as a standalone simulator outside of the web application). My constraints have now shifted to be only a web application, so I don't have to worry about this additional level of complexity.
My main problem with my previous implementation was handling concurrent requests. I made the mistake of using one database (which was represented by a single file on disk) to power both the simulation aspect (which ran in a separate process on the server) and the web application. Hence, when users were accessing the website concurrently with a live simulation happening, there were all sorts of database access issues since it was getting locked by one process. I fixed this by implementing a cross-process mutex on database operations but this drastically slowed down the performance of the website.
The tools I will be using are:
ASP.NET for the web application.
SQL Server 2008 R2 for the database... probably with an NHibernate layer for object relational mapping.
My question is, how do I design this so I will achieve optimal efficiency as well as concurrent access? Obviously shifting to an actual DB server from a file will have it's positives, but do I need to have two redundant servers--one for the simulation process and one for the web server process?
Any suggestions would be appreciated!
Thanks.
You should be fine doing both on the same database. Concurrent access is what modern database engines are designed for. Concurrent reads are usually no problem at all; concurrent writes lock the minimum possible amount of data (a table, or even just a number of rows), not the entire database.
A few things you should keep in mind though:
Use transactions wisely. On the one hand, a transaction is an important tool in making sure your database is always consistent - in short, a transaction either happens completely, or not at all. On the other hand, two concurrent transactions can cause deadlocks, and those buggers can be extremely hard to debug.
Normalize, and use constraints to protect your data integrity. Enforcing foreign keys can save the day, even though it often leads to more cumbersome administration.
Minimize the amount of time spent on data access: don't keep connections around when you don't need them, make absolutely sure you're not leaking any connections, don't fetch data you know don't need, do as much data-related processing (especially things that can be solved using joins, subqueries, groupings, views, etc.) in SQL instead of in code
Background:
Enterprise application - very will written for its time in 2004.
Stack:
.NET, Heavy use of Remoting, ASMX style web services, SQL Server
Problem:
The application allows user to go through various wizards for lack of a better term, all of their actions are stored in what we call "wiz state", which is essentially XML that is persisted to a SQL server database very frequently because we allow users to pause/resume their application. Often in these wizards, the XML that comprises the wizard state grows very large, I'm talking 5-8 MB of data, and we noticed that when we had a sudden influx of simultaneous users, we started receiving occasional timeouts against the database, because a lot of what the wizard state is comprised of, is keeping track of collections of "things". Sometimes these custom collections grow very large.
Question:
We were in a meeting today and we're expecting a flurry of activity in October that will test the system like never before, and possibly result in huge wizard states that go back and forth from the web server to the database. The crux of the situation is that there is only one database and one web server.
For arguments sake, because of the complexity of the application, lets say adding any kind of clustering/mirroring to increase database throughput is out of the question. I spoke up in the meeting and said the quickest way to address this in the shortest time period would be to add more servers to the front end web application so the load could be distributed amongst web servers. The development lead said I was completely wrong and it would have no effect because we only have one database, so adding more web power would do nothing. He is having one of the other developers reduce the xml bloat that we persist frequently to the database. Probably in the long run, reducing the size of the xml that we pass back and forth is the right idea, but will adding additional web servers truly have no effect, I just think in terms of simultaneous users, it should help.
Any responses thoughts are appreciated, proof that more web servers would help would be pure win.
Thanks.
EDIT: We use binary serialization to store the XML in the database in an image field.
I haven't heard anything about locating the "bottlenecks". Isn't that the first thing to do? Here's the method I use.
Otherwise you're just investing in guesses. That won't work.
I've been in meetings like that, where everybody gets excited throwing ideas around, and "management" wants to make "decisions", but it's the blind leading the blind. Knuckle down and find out what's going on. You can't do that in meetings.
Some time ago I looked at a performance problem with some similarity to yours. The biggest "bottleneck" was in writing and parsing XML, with attendant memory allocation, setup, and destruction. Then there were others as well. You might find the same thing, or something different.
P.S. I keep quoting "bottleneck" because all the performance problems I've found have been nothing at all like the necks of bottles. Rather they are like way over-bushy call trees that need radical pruning, such as making and reading mountains of XML for no good reason.
If the rate at which the data is written by SQL is the bottleneck, feeding data to SQL more quickly should have no effect.
I am not sure exactly what the data structure is, but perhaps compressing the XML data on the web server(s) before writing may have a positive effect.
If the bottleneck is the database, then more web services will not help you a lot.
The problem may be that the problem is not only the size of the data, but the number of concurrent request to the same table. The number of writes will be the big problem. If your XML write is in a transaction with other queries you may try to break out the XML write from that transaction to reduce locking time of the XML table.
As stated by vdeych you may try compression to reduce the data size. (That would increase the load on the web servers.)
You may also try caching the data. Only read from the SQL server if the data is not already in the cache. Make sure you don't update the SQL server if your data has not changed.
No one seems to have suggest this, what about replacing your XML serialization of your wizard with JsonSerialization.
Not only should this give you a minor boost in performance in the serialization itself since both the DataContractSerializer (faster) and Newtonsoft Json.NET (fastest) out perform the XML serializers in .NET. This should easily reduce the size of your object graph by upwards of 50% or more (depending on number of properties vs large strings in the XML).
This should dramatically lower the IO that is inflicted upon Sql server. This should also limit the amount of scope required to alter your application significantly (assuming it's well designed and works through common calls for serialization/deserialization).
If you choose to go this route also invest time comparing BSON vs JSON as I think it would be likely that the binary encoded one will offer even more space savings (and further IO reduction) due to the size of your object graphs.
I'm not a .NET expert but maybe using a binary serialization would increase throughput. Making sure that the XML isn't stored as text (fairly obvious but thought I'd mention it). Also relational databases are best for storing relational data, so perhaps substituting an ORM layer in place of the serialization (sounds feasible) could speed things up.
Mike is spot on, without understanding the resource constaint leading to the performance issues, no amount of discussion will resolve the problem. I'll add that socket timeouts that affect running statements are a symptom, and are never imposed by SQL Server, they're an artifact of your driver configuration or a firewall or similar device between app and db imposing them (unless you're talking about timeouts for new connections, then you have a host in serious distress under load).
Given your symptom is database timeouts, you need to start there. If they're indicative of long running statements that result in a socket timeout, use SQL Server profiler to capture the workload while simultaneously monitoring system resources. Given it's a mature application and the type of workload you mention, it's unlikely to be statement tuning related, it probably boils down to resource limitations CPU, memory or disk IO capacity
This Technet guide is a very good place to start:
http://technet.microsoft.com/en-us/library/cc966540.aspx
If it's resource contention, then it's a simple discussion about how the resource contention can be tuned, configured for or addressed by adding more of whatever is needed.
Edit: I should add that given a database performance issue, more applications servers is likely to worsen the problem as you increase the amount of concurrency, that might otherwise be kept in check by connection pool, request processing or other limits.
Our client follows SOA principles and have design web services that are very fine grained like createCustomer, deleteCustomer, etc.
I am not sure if fine grained services are desirable as they create transactional related issues. for e.g. if a business requirement is every Customer must have a Address when it's created. So in this case, the presentation component will invoke createCustomer first and then createAddress. The services internally use simple JDBC to update the respective tables in db. As a service is invoked by external component, it has not way of fulfilling transactional requirement here i.e. if createAddress fails, createCustomer operation must be rolledback.
I guess, one of the approach to deal with this is to either design course grained services (that creates a Customer and associated Address in one single JDBC transaction) or
perhaps simple create a reversing service (deleteCustomer) that simply reverses the action of createCustomer.
any suggestions. thanks
The short answer: services should be designed for the convenience of the service client. If the client is told "call this, then cdon't forget to call that" you're making their lives too difficult. There should be a coarse-grained service.
A long answer: Can a Customer reasonably be entered with no Address? So we call
createCustomer( stuff but no address)
and the result is a valid (if maybe not ideal) state for a customer. Later we call
changeCustomerAddress ( customerId, Address)
and now the persisted customer is more useful.
In this scenario the API is just fine. The key point is that the system's integrity does not depend upon the client code "remembering" to do something, in this case to add the address. However, more likely we don't want a customer in the system without an address in which case I see it as the service's responsibility to ensure that this happens, and to give the caller the fewest possibilities of getting it wrong.
I would see a coarse-grained createCompleteCustomer() method as by far the best way to go - this allows the service provider to solve the problem once rather then require every client programmer to implement the logic.
Alternatives:
a). There are web Services specs for Atomic Transactions and major vendors do support these specs. In principle you could actually implement using fine-grained methods and true transactions. Practically, I think you enter a world of complexity when you go down this route.
b). A stateful interface (work, work, commit) as mentioned by #mtreit. Generally speaking statefulness either adds complexity or obstructs scalability. Where does the service hold the intermediate state? If in memeory, then we require affinity to a particular service instance and hence introduce scaling and reliability problems. If in some State or Work-in-progress database then we have significant additional implementation complexity.
Ok, lets start:
Our client follows SOA principles and
have design web services that are very
fine grained like createCustomer,
deleteCustomer, etc.
No, the client has forgotten to reach the SOA principles and put up what most people do - a morass of badly defined interfaces. For SOA principles, the clinent would have gone to a coarser interface (such asfor example the OData meachsnism to update data) or followed the advice of any book on multi tiered architecture written in like the last 25 years. SOA is just another word for what was invented with CORBA and all the mistakes SOA dudes do today where basically well known design stupidities 10 years ago with CORBA. Not that any of the people doing SOA today has ever heard of CORBA.
I am not sure if fine grained services
are desirable as they create
transactional related issues.
Only for users and platforms not supporting web services. Seriously. Naturally you get transactional issues if you - ignore transactional issues in your programming. The trick here is that people further up the food chain did not, just your client decided to ignore common knowledge (again, see my first remark on Corba).
The people designing web services were well aware of transactional issues, which is why web service specification (WS*) contains actually mechanisms for handling transactional integrity by moving commit operations up to the client calling the web service. The particular spec your client and you should read is WS-Atomic.
If you use the current technology to expose your web service (a.k.a. WCF on the MS platform, similar technologies exist in the java world) then you can expose transaction flow information to the client and let the client handle transaction demarcation. This has its own share iof problems - like clients keeping transactions open maliciously - but is still pretty much the only way to handle transactions that do get defined in the client.
As you give no platform and just mention java, I am pointing you to some MS example how that can look:
http://msdn.microsoft.com/en-us/library/ms752261.aspx
Web services, in general, are a lot more powerfull and a lot more thought out than what most people doing SOA ever think about. Most of the problems they see have been solved a long time ago. But then, SOA is just a buzz word for multi tiered architecture, but most people thinking it is the greatest thing since sliced bread just dont even know what was around 10 years ago.
As your customer I would be a lot more carefull about the performance side. Fine grained non-semantic web services like he defines are a performance hog for non-casual use because the amount of times you cross the network to ask / update small small small small stuff makes the network latency kill you. Creating an order for like 10 goods can easily take 30-40 network calls in this scenario which will really possibly take a lot of time. SOA preaches, ever since the beginning (if you ignore the ramblings of those who dont know history) to NOT use fine grained calls but to go for a coarse grained exchange of documents and / or a semantical approach, much like the OData system.
If transactionality is required, a coarser-grained single operation that can implement transaction-semantics on the server is definitely going to be much simpler to implement.
That said, certainly it is possible to construct some scheme where the target of the operations is not committed until all of the necessary fine-grained operations have succeeded. For instance, have a Commit operation that checks some flag associated with the object on the server; the flag is not set until all of the necessary steps in the transaction have completed, and Commit fails if the flag is not set.
Of course, if having light-weight, fine grained operations is an important design requirement, perhaps the need to have transactionality should be re-thought.
When starting a new ASP.NET application, with the knowledge that at some point in the future it must scale, what are the most important design decisions that will allow future scalability without wholsesale refactoring?
My Top three decisions are
Disabling or storing session state
in a database.
Storing as little as possible in session state.
Good N-Tier Architecture. Separating business logic and using Webservices instead of directly accessing DLL's ensures that you can scale out both the business layer as well as the presentation layer. Your database will likely be able to handle anything you throw at it although you can probably cluster that too if needed.
You could also look at partitioning data in the database too.
I have to admit though I do this regardless of whether the site has to scale or not.
These are our internal ASP.Net Do's and Don't Do's for massively visited web applications:
General Guidelines
Don't use Sessions - SessionState=Off
Disable ViewState completely - EnableViewState=False
Don't use any of the complext ASP.Net UI controls, stick to basic (DataGrid vs. Simple repeater)
Use fastest and shortest data access
mechanisms (stick to sqlreaders on
the front site)
Application Architecture
Create a caching manager with an abstraction layer. This will allow you to replace the simple System.Web.Cache with a more complex distributed caching solution in the future when you start scaling you application.
Create a dedicated I/O manager with an abstraction layer to support future growth (S3 anyone?)
Build timing tracing into your main pipelines which you can switch on and off, this will allow you to detect bottle necks when such occur.
Employ a background processing mechanism and move whatever is not required to render the current page for it to chew on.
Better yet - consider firing events from your application to other applications so they can do that async work.
Prepare for database scalability, place your own layer so that you can later decide if you want to partition you database or alternatively work with several read servers in a master-slave scenario.
Above all, learn from others successes and failures and stay positive.
Ensure you have a solid caching policy for transient / static data. Database calls are expensive especially with separate physical servers so be aggressive with your caching.
There are so many considerations, that one could write a book on the subject. In fact, there is a great book and it is free. ;-)
Microsoft has released Improving .NET Application Performance and Scalability as a PDF eBook.
It is worth reading cover to cover, if you don't mind the droll writing style. Not only does it identify key performance scenarios, but also establishing benchmarks, measuring performance, and how to apply what you learn.