Can I rely on riak as master datastore in e-commerce - riak

In riak documentation, there are often examples that you could model your e-commerce datastore in certain way. But here is written:
In a production Riak cluster being hit by lots and lots of concurrent writes,
value conflicts are inevitable, and Riak Data Types
are not perfect, particularly in that they do not guarantee strong
consistency and in that you cannot specify the rules yourself.
From http://docs.basho.com/riak/latest/theory/concepts/crdts/#Riak-Data-Types-Under-the-Hood, last paragraph.
So, is it safe enough to user Riak as primary datastore in e-commerce app, or its better to use another database with stronger consistency?

Riak out of the box
In my opinion out of the box Riak is not safe enough to use as the primary datastore in an e-commerce app. This is because of the eventual consistency nature of Riak (and a lot of the NoSQL solutions).
In the CAP Theorem distributed datastores (Riak being one of them) can only guarentee at most 2 of:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response
about whether it succeeded or failed)
Partition tolerance (the system
continues to operate despite arbitrary partitioning due to network
failures)
Riak specifically errs on the side of Availability and Partition tolerance by having eventual consistency of data held in its datastore
What Riak can do for an e-commerce app
Using Riak out of the box, it would be a good source for the content about the items being sold in your e-commerce app (content that is generally written once and read lots is a great use case for Riak), however maintaining:
count of how many items left
money in a users' account
Need to be carefully handled in a distributed datastore.
Implementing consistency in an eventually consistent datastore
There are several methods you can use, they include:
Implement a serialization method when writing updates to values that you need to be consistent (ie: go through a single/controlled service that guarantees that it will only update a single item sequentially), this would need to be done outside of Riak in your API layer
Change the replication properties of your consistent buckets so that you can 'guarantee' you never retrieve out of date data
At the bucket level, you can choose how many copies of data you want
to store in your cluster (N, or n_val), how many copies you wish to
read from at one time (R, or r), and how many copies must be written
to be considered a success (W, or w).
The above method is similar to using the strong consistency model available in the latest versions of Riak.
Important note: In all of these data store systems (distributed or not) you in general will do:
Read the current data
Make a decision based on the current value
Change the data (decrement the Item count)
If all three of the above actions cannot be done in atomic way (either by locking or failing the 3rd if the value was changed by something else) an e-commerce app is open to abuse. This issue exists in traditional SQL storage solutions (which is why you have SQL Transactions).

Related

How does BaseX handle concurrency?

I'm looking at using BaseX as a more flexible database.
How does it handle database concurrency? How does it work in a web app scenario, where two different users could update the same data and effectively get a "dirty read"?
How does it work in a web app scenario, where two different users could update the same data and effectively get a "dirty read"?
Be sure: Transactions are isolated from each other, so that updated anomalies cannot occur.
How does it handle database concurrency?
Have a look at the BaseX wiki page about transaction management, where the approach is described in-detail. Disclaimer: I implemented the newer database locking for BaseX during my thesis work, so I'm involved in the project.
BaseX applies several mechanics to prevent colliding transactions. The old process locking (which still can be enabled using the GLOBALLOCK option) simply denies multiple queries within a process, parallel execution could be achieved throughout multiple database instances, while basic isolation was achieved through per-database file system locks (without any guarantees regarding deadlocks, ...).
The newer database locking isolates parallel transactions by applying two phase locking on database level. Thus, two queries accessing multiple databases do run in parallel given they access different databases, otherwise one of them will have to wait (but they do not run at the same time, for sure). A drawback is that as we want to support deadlock free execution, we went for strict two phase locking, which fetches all database locks before execution of the query, but suffers from a penalty as determining which databases will be accessed is rather difficult in a dynamic language as XQuery, often failing with global locks on all databases.
For the future (given time allows, and no schedule is set) some optimizations are in queue, especially relaxing the strictness for two phase locking and the optimistic concurrency control I already evaluated in my thesis that would bring large gains in parallel execution, especially for web application scenarios.

What documentation exists for DynamoDB's consistency model, CAP, partition recovery etc?

I'm considering using Amazon's DynamoDB. Naturally, if you're bothering to use a highly available distributed data store, you want to make sure your client deals with outages in a sensible way!
While I can find documentation describing Amazon's "Dynamo" database, it's my understanding that "DynamoDB" derives its name from Dynamo, but is not at all related in any other way.
For DynamoDB itself, the only documentation I can find is a brief forum post which basically says "retry 500 errors". For most other databases much more detailed information is available.
Where should I be looking to learn more about DynamoDB's outage handling?
While Amazon DynamoDB indeed lacks a detailed statement about their choices regarding the CAP theorem (still hoping for a DynamoDB edition of Kyle Kingsbury's most excellent Jepsen series - Call me maybe: Cassandra analyzes a Dynamo inspired database), Jeff Walker Code Ranger's answer to DynamoDB: Conditional writes vs. the CAP theorem confirms the lack of clear information in this area, but asserts that we can make some pretty strong inferences.
The referenced forum post also suggests a strong emphasis on availability too in fact:
DynamoDB does indeed synchronously replicate across multiple
availability zones within the region, and is therefore tolerant to a
zone failure. If a zone becomes unavailable, you will still be able to
use DynamoDB and the service will persist any successful writes that
we have acknowledged (including writes we acknowledged at the time
that the availability zone became unavailable).
The customer experience when a complete availability zone is lost
ranges from no impact at all to delayed processing times in cases
where failure detection and service-side redirection are necessary.
The exact effects in the latter case depend on whether the customer
uses the service's API directly or connects through one of our SDKs.
Other than that, Werner Vogels' posts on Dynamo/DynamoDB provide more insight eventually:
Amazon's Dynamo - about the original paper
Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications - main introductory article including:
History of NoSQL at Amazon – Dynamo
Lessons learned from Amazon's Dynamo
Introducing DynamoDB - this features the most relevant information regarding the subject matter
Durable and Highly Available. Amazon DynamoDB replicates its data over
at least 3 different data centers so that the system can continue to
operate and serve data even under complex failure scenarios.
Flexible. Amazon DynamoDB is an extremely flexible system that does
not force its users into a particular data model or a particular
consistency model. DynamoDB tables do not have a fixed schema but
instead allow each data item to have any number of attributes,
including multi-valued attributes. Developers can optionally use
stronger consistency models when accessing the database, trading off
some performance and availability for a simpler model. They can also
take advantage of the atomic increment/decrement functionality of
DynamoDB for counters. [emphasis mine]
DynamoDB One Year Later: Bigger, Better, and 85% Cheaper… - about improvements
Finally, Aditya Dasgupta's presentation about Amazon's Dynamo DB also analyzes its modus operandi regarding the CAP theorem.
Practical Guidance
In terms of practical guidance for retry handling, the DynamoDB team has meanwhile added a dedicated section about Handling Errors, including Error Retries and Exponential Backoff.

How efficient is Meteor's DDP at syncing very large collections?

Meteor's DDP protocol works very well for syncing a small collection of data from a server to a browser-based client, which inherently limits the amount of data that is processed.
However, consider a situation where Meteor is being used to sync a large collection from one server to another, or just the DDP protocol itself is used to sync one MongoDB with another.
How efficient is DDP in this case (computationally)? How well does it scale to several clients? Is the limit to performance only bandwidth or will DDP hit some CPU bound as well? What is the largest amount of data that can be reasonably synced over DDP right now? Is DDP just the wrong approach for doing this (see references below)?
Some additional thoughts:
As far as I know, the current version of DDP keeps track of each client's entire collection, so it can't be asymptotically very efficient.
Smart Collections were created to improve the performance of server-to-client collection of syncing. But it's unclear to me if this is improving DDP or something else.
See also:
How to implement real-time replication of MongoDB (or CouchDB) to many remote clients
DDP vs Straight MongoDB access for synching large amounts of data
EDIT:
After some empirical experience with this, I have to conclude that the answer is "not very efficient". See https://stackoverflow.com/a/21835534/586086 for an explanation.
Discussions with Meteor devs indicated that this problem will be addressed in the future with a revision of DDP and the publish-subscribe API, whereby the merge box will be removed and clients will handle merging. This will save CPU/memory on the server and allow for much larger datasets to be sent over the wire.
Basically it is more a matter of what and how you are publishing to the client than the number of clients. A request is usually handled in log2(N) if indexed, therefore it is quite easy for the server to recompute the result set even if (in the worst case) the whole collection would change. So, from the server side you can quite quickly get the new result sets to publish to the clients (if they changed from the one they had already).
The real problem (and common error) comes when you do publish everything to the client (like with the former autopublish), so make a publication wisely so that you do only give what the client is supposed to see. You can either prune the documents hiding useless fields or reduce the result set to send to the client by creating a publication with specific to your data scope of use parameters.
Data reactivity (session parameter bound on a publication) should also be handled with care, if for example you are sending a request each time you press a key in the search field, you might quickly overload the connection (still strongly depending on the size of the set you are publishing). We had to take care of this trying to build real estate service over meteor, the data set being over several gigabytes it was quite challenging to handle this without blocking the pipe with overloaded data.
In term of bandwidth, the DDP is quite good because it does supports clever entries updating (sending only fields changes instead of the whole document), moving an item is (will be) supported to (server side reordering).
Also take a look on this excellent answer concerning huge collections, what is done under the hood.

Meteor server-side memory usage for thousands of concurrent users

Based on this answer, it looks like the meteor server keeps an in-memory copy of the cache for each connected client. My understanding is that it gets used in order to avoid sending multiple copies of data when dealing with overlapping subscriptions on a client.
The relevant part of the linked answer (emphasis is mine):
The merge box: The job of the merge box is to combine the results (added, changed and removed calls) of all of a client's active publish functions into a single data stream. There is one merge box for each connected client. It holds a complete copy of the client's minimongo cache.
Assuming that answer is still accurate in the current version of meteor, couldn't that create a huge waste of memory on the server as the number of users increases?
As an off-the-cuff calculation, if an app had about a 100kB cache per client, then 10,000 concurrent users would use up 1GB of memory on the server, and 100,000 users a whopping 10GB! This would be true even if each client was looking at almost identical data. It seems plausible for an app use much more data than that per client, which would further exacerbate the problem.
Does this problem exist in the current version of Meteor? If so, what techniques can be used to limit the amount of memory the server needs to use to manage all the client subscriptions?
Take a look at this post by Arunoda at his meteorhacks.com blog:
http://meteorhacks.com/making-meteor-500-faster-with-smart-collections.html
which talks about his Smart Collections page:
http://meteorhacks.com/introducing-smart-collections.html
He created an alternative Collection stack which has succeeded in it's goals for speed, efficiency (memory & cpu) and scalability (you can see a graphed comparison in the post). Admittedly in his tests RAM usage was negligent with both Collection types, although the way he's implemented things there should be a very obvious difference with the type of use case you mentioned.
Also, you can see in this post on meteor-core:
https://groups.google.com/d/msg/meteor-core/jG1KLObX1bM/39aP4kxqWZUJ
that the Meteor developers are aware of his work and are cooperating in implementing some of the improvements into Meteor itself (but until then his smart package works great).
Important note! Smart collections relies on access to the Mongo Oplog. This is easy if you're running on your own machine or hosted infrastructure. If you're using a cloud based database, this option might not be available, or if it is, will cost a lot more than the smaller packages.

Live Data Web Application Design

I'm about to begin designing the architecture of a personal project that has the following characteristics:
Essentially a "game" containing several concurrent users based on a sport.
Matches in this sport are simulated on a regular basis and their results stored in a database.
Users can view the details of a simulated match "live" when it is occurring as well as see results after they have occurred.
I developed a similar web application with a much smaller scope as the previous iteration of this project. In that case, however, I chose to go with SQLite as my DB provider since I also had a redistributable desktop application that could be used to manually simulate matches (and in fact that ran as a standalone simulator outside of the web application). My constraints have now shifted to be only a web application, so I don't have to worry about this additional level of complexity.
My main problem with my previous implementation was handling concurrent requests. I made the mistake of using one database (which was represented by a single file on disk) to power both the simulation aspect (which ran in a separate process on the server) and the web application. Hence, when users were accessing the website concurrently with a live simulation happening, there were all sorts of database access issues since it was getting locked by one process. I fixed this by implementing a cross-process mutex on database operations but this drastically slowed down the performance of the website.
The tools I will be using are:
ASP.NET for the web application.
SQL Server 2008 R2 for the database... probably with an NHibernate layer for object relational mapping.
My question is, how do I design this so I will achieve optimal efficiency as well as concurrent access? Obviously shifting to an actual DB server from a file will have it's positives, but do I need to have two redundant servers--one for the simulation process and one for the web server process?
Any suggestions would be appreciated!
Thanks.
You should be fine doing both on the same database. Concurrent access is what modern database engines are designed for. Concurrent reads are usually no problem at all; concurrent writes lock the minimum possible amount of data (a table, or even just a number of rows), not the entire database.
A few things you should keep in mind though:
Use transactions wisely. On the one hand, a transaction is an important tool in making sure your database is always consistent - in short, a transaction either happens completely, or not at all. On the other hand, two concurrent transactions can cause deadlocks, and those buggers can be extremely hard to debug.
Normalize, and use constraints to protect your data integrity. Enforcing foreign keys can save the day, even though it often leads to more cumbersome administration.
Minimize the amount of time spent on data access: don't keep connections around when you don't need them, make absolutely sure you're not leaking any connections, don't fetch data you know don't need, do as much data-related processing (especially things that can be solved using joins, subqueries, groupings, views, etc.) in SQL instead of in code

Resources