High write concurrency backend for storing large set/array based data? - sqlite

The problem:
I have a web service that needs to check membership of a given string against a set of strings, where number of elements in the set will be under constant growth, potentially numbering in the hundreds of millions.
If the string is not a member of the set, it gets added to the set. The string size will be a constant 32 bytes. Only one set variable is required, no other variables need to be persisted.
This check is performed as part of a callback on a webhook, thus performance is critical.
While my use case pretty much fits a bloom filter perfectly, I'm having trouble finding a solution to deal with the persistent storage vs i/o concurrency portion of the problem.
Environment:
DigitalOcean/Linux/Python/Flask, but open to change if required
Possible Solutions:
redis, storing the variable in a set, and then querying via sismember for a nice o(1) based solution. This is what we are currently using, but this solution doesn't scale well with a large number of keys given that everything must fit in memory, and it also has issues with write concurrency when traffic increases.
sqlite, with WAL mode turned on. concerned about lock contention when the server gets hit with a significant number of webhook requests (SQLITE_BUSY). Local server file doesn't scale across host machines.
postgres, seems like a nice middle ground solution, but might have to deal with lock contention here as well for write concurrency.
cassandra, given it's focus on write performance. overkill for storing a single column though?
custom bloom filter backend, not sure if something like this exists that provides the functionality of a bloom filter with a high i/o concurrency storage backend.
Thoughts?

The Redis solution can scale well with data sharding. You can set up several Redis instances (or use Redis-Cluster), split your data into several parts, i.e. shardings, and save each part in a different Redis instance.
When you want to check the membership of a given string, you can send the sismenber command to the corresponding Redis instance. Take this answer as an example of how to split data with hash functions.
Also, you can implement bloom filter with Redis (GETBIT and SETBIT). Just a reminder, bloom filter has the false positive problem.

First, you don't need to use sismember. Just do sadd systematically, and test the returned value. If it's 0, the value was already in the set, and so was not added. Doing so you will very easily reduce the number of requests to Redis.
Second, the description of your problem looks like a perfect match for Hbase, which is made for storing very large data set and query them using bloom filters. But you'll probably find it's overkill, just like Cassandra.

Related

DynamoDB: Conditional writes vs. the CAP theorem

Using DynamoDB, two independent clients trying to write to the same item at the same time, using conditional writes, and trying to change the value that the condition is referencing. Obviously, one of these writes is doomed to fail with the condition check; that's ok.
Suppose during the write operation, something bad happens, and some of the various DynamoDB nodes fail or lose connectivity to each other. What happens to my write operations?
Will they both block or fail (sacrifice of "A" in the CAP theorem)? Will they both appear to succeed and only later it turns out that one of them actually was ignored (sacrifice of "C")? Or will they somehow both work correctly due to some magic (consistent hashing?) going on in the DynamoDB system?
It just seems like a really hard problem, but I can't find anything discussing the possibility of availability issues with conditional writes (unlike with, for instance, consistent reads, where the possibility of availability reduction is explicit).
There is a lack of clear information in this area but we can make some pretty strong inferences. Many people assume that DynamoDB implements all of the ideas from its predecessor "Dynamo", but that doesn't seem to be the case and it is important to keep the two separated in your mind. The original Dynamo system was carefully described by Amazon in the Dynamo Paper. In thinking about these, it is also helpful if you are familiar with the distributed databases based on the Dynamo ideas, like Riak and Cassandra. In particular, Apache Cassandra which provides a full range of trade-offs with respect to CAP.
By comparing DynamoDB which is clearly distributed to the options available in Cassandra I think we can see where it is placed in the CAP space. According to Amazon "DynamoDB maintains multiple copies of each item to ensure durability. When you receive an 'operation successful' response to your write request, DynamoDB ensures that the write is durable on multiple servers. However, it takes time for the update to propagate to all copies." (Data Read and Consistency Considerations). Also, DynamoDB does not require the application to do conflict resolution the way Dynamo does. Assuming they want to provide as much availability as possible, since they say they are writing to multiple servers, writes in DyanmoDB are equivalent to Cassandra QUORUM level. Also, it would seem DynamoDB does not support hinted handoff, because that can lead to situations requiring conflict resolution. For maximum availability, an inconsistent read would only have to be at the equivalent of Cassandras's ONE level. However, to get a consistent read given the quorum writes would require a QUORUM level read (following the R + W > N for consistency). For more information on levels in Cassandra see About Data Consistency in Cassandra.
In summary, I conclude that:
Writes are "Quorum", so a majority of the nodes the row is replicated to must be available for the write to succeed
Inconsistent Reads are "One", so only a single node with the row need be available, but the data returned may be out of date
Consistent Reads are "Quorum", so a majority of the nodes the row is replicated to must be available for the read to succeed
So writes have the same availability as a consistent read.
To specifically address your question about two simultaneous conditional writes, one or both will fail depending on how many nodes are down. However, there will never be an inconsistency. The availability of the writes really has nothing to do with whether they are conditional or not I think.

Caching result of SELECT statement for reuse in multiple queries

I have a reasonably complex query to extract the Id field of the results I am interested in based on parameters entered by the user.
After extracting the relevant Ids I am using the resulting set of Ids several times, in separate queries, to extract the actual output record sets I want (by joining to other tables, using aggregate functions, etc).
I would like to avoid running the initial query separately for every set of results I want to return. I imagine my situation is a common pattern so I am interested in what the best approach is.
The database is in MS SQL Server and I am using .NET 3.5.
It would definitely help if the question contained some measurements of the unoptimized solution (data sizes, timings). There is a variety of techniques that could be considered here, some listed in the other answers. I will assume that the reason why you do not want to run the same query repeatedly is performance.
If all the uses of the set of cached IDs consist of joins of the whole set to additional tables, the solution should definitely not involve caching the set of IDs outside of the database. Data should not travel there and back again if you can avoid it.
In some cases (when cursors or extremely complex SQL are not involved) it may be best (even if counterintuitive) to perform no caching and simply join the repetitive SQL to all desired queries. After all, each query needs to be traversed based on one of the joined tables and then the performance depends to a large degree on availability of indexes necessary to join and evaluate all the remaining information quickly.
The most intuitive approach to "caching" the set of IDs within the database is a temporary table (if named #something, it is private to the connection and therefore usable by parallel independent clients; or it can be named ##something and be global). If the table is going to have many records, indexes are necessary. For optimum performance, the index should be a clustered index (only one per table allowed), or be only created after constructing that set, where index creation is slightly faster.
Indexed views are cleary preferable to temporary tables except when the underlying data is read only during the whole process or when you can and want to ignore such updates to keep the whole set of reports consistent as far as the set goes. However, the ability of indexed views to always accurately project the underlying data comes at a cost of slowing down those updates.
One other answer to this question mentions stored procedures. This is largely a way of organizing your code. However, it if you go this way, it is preferable to avoid using temporary tables, because such references to a temporary table prevent pre-compilation of the stored procedure; go for views or indexed views if you can.
Regardless of the approach you choose, do not guess at the performance characteristics and query optimizer behavior. Learn to display query execution plans (within SQL Server Management Studio) and make sure that you see index accesses as opposed to nested loops combining multiple large sets of data; only add indexes that demonstrably and drastically change the performance of your queries. A well chosen index can often change the performance of a query by a factor of 1000, so this is somewhat complex to learn but crucial for success.
And last but not least, make sure you use UPDATE STATISTICS when repopulating the database (and nightly in production), or your query optimizer will not be able to put the indexes you have created to their best uses.
If you are planning to cache the result set in your application code, then ASP.NET has cache, Your Winform will have the object holding the data with it with which you can reuse the data.
If planning to do the same in SQL Server, you might consider using indexed views to find out the Id's. The view will be materialized and hence you can get the results faster. You might even consider using a staging table to hold the id's temporarily.
With SQL Server 2008 you can pass table variables as params to SQL. Just cache the IDs and then pass them as a table variable to the queries that fetch the data. The only caveat of this approach is that you have to predefine the table type as UDT.
http://msdn.microsoft.com/en-us/library/bb510489.aspx
For SQL Server, Microsoft generally recommends using stored procedures whenever practical.
Here are a few of the advantages:
http://blog.sqlauthority.com/2007/04/13/sql-server-stored-procedures-advantages-and-best-advantage/
* Execution plan retention and reuse
* Query auto-parameterization
* Encapsulation of business rules and policies
* Application modularization
* Sharing of application logic between applications
* Access to database objects that is both secure and uniform
* Consistent, safe data modification
* Network bandwidth conservation
* Support for automatic execution at system start-up
* Enhanced hardware and software capabilities
* Improved security
* Reduced development cost and increased reliability
* Centralized security, administration, and maintenance for common routines
It's also worth noting that, unlike other RDBMS vendors (like Oracle, for example), MSSQL automatically caches all execution plans:
http://msdn.microsoft.com/en-us/library/ms973918.aspx
However, for the last couple of versions of SQL Server, execution
plans are cached for all T-SQL batches, regardless of whether or not
they are in a stored procedure
The best approach depends on how often the Id changes, or how often you want to look it up again.
One technique is to simply store the result in the ASP.NET object cache, using the Cache object (also accessible from HttpRuntime.Cache). For example (from a page):
this.Cache["key"] = "value";
There are many possible variations on this theme.
You can use Memcached to cache values in the memory.
As I see there are some .net ports.
How frequently does the data change that you'll be querying? To me, this sounds like a perfect scenario for data warehousing, where you flatting the data for quicker data retrieval and create the tables exactly as your 'DTO' wants to see the data. This method is different than an indexed view in that it's simply a table which will have quick seek operations, and could especially be improved if you setup the indexes properly on the columns that you plan to query
You can create Global temporary Table. Create the table on the fly. Now insert the records as per your request. Access this table in your next request in your joins... for reusability

Oracle sequence cache aging too often

my asp.net application uses some sequences to generate tables primary keys. Db administrators have set the cache size to 20. Now the application is under test and a few records are added daily (say 4 for each user test session).
I've found that new test session records always use new cache portions as if the preavious day cached numbers had expired, losing tenth of keys everyday. I'd like to understand if it's due to some mistake i might have made in my application (disposing of tableadapters or whatever) or if it's the usual behaviour. There are programming best practices to take into account when handling oracle sequences ?
Since the application will not have to bear an heavy load of work (say 20-40 new records at day), i was tinking if it might be the case to set a smaller cache size or none at all.
Does sequence cache resizing implies the reset of current index ?
thank you in advance for any hint
The answer from Justin Cave in this thread might be interesting for you:
http://forums.oracle.com/forums/thread.jspa?threadID=640623
In a nutshell: if the sequence is not accessed frequently enough but you have a a lot of "traffic" in the library cache, then the sequence might be aged out and removed from the cache. In that case the pre-allocated values are lost.
If that happens very frequently to you, it seems that your sequence is not used very often.
I guess that reducing the cache size (or completely disabling it) will not have a noticable impact on performance in your case (also when taking your statement of 20-40 new records a day into account)
Oracle Sequences are not gap-free. Reducing the Cache size will reduce the gaps... but you will still have gaps.
The sequence is not associated to the table by the database, but by your code (via the nextval on the insert via trigger/sql/pkg api) -- on that note you may use the same sequence over multiple tables (it is not like sql server's identity where it is associated to the column/ table)
thus changing the sequence will have no impact on the indexes.
You would just need to make sure if you drop the sequence and restart it, you 'reseed' to the +1 of the current value (e.g. create sequence seqer start with 125 nocache;)
, but
If your application requires a
gap-free set of numbers, then you
cannot use Oracle sequences. You must
serialize activities in the database
using your own developed code.
but be forewarned, you may increase disk IO and possible transaction locking if you choose not to use sequences.
The sequence generator is useful in
multiuser environments for generating
unique numbers without the overhead of
disk I/O or transaction locking.
to reiterate a_horse_with_no_name's comments, what is the issue with gaps in the id?
Edit
also have a look at the caching logic you should use located here:
http://download.oracle.com/docs/cd/E11882_01/server.112/e17120/views002.htm#i1007824
If you are using the sequence for PKs and not to enforce some application logic then you shouldn't worry about gaps. However, if there is some application logic tied to sequential sequence values, you will have holes if you use sequence caching and do not have a busy system. Sequence cache values can be aged out of the library cache.
You say that your system is not very busy, in this case alter your sequence to no cache. You are in a position of taking a negligible performance hit to fix a logic issue so you might as well.
As people mentioned: Gaps shouldn't be a problem, so if you are requiring no gaps you are doing something wrong. (But I don't think this is what you want).
Reducing the cache should reduce the number and decrease the performance of the sequence especially with concurrent access to it. (which shouldn't be a problem in your use case).
Changing the sequence using the alter sequence statement (http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_2011.htm) should not reset the current/next val of the sequence.

ASP.NET data caching design

I have method in my BLL that interacts with the database and retrieves data based on the defined criteria.
The returned data is a collection of FAQ objects which is defined as follows:
FAQID,
FAQContent,
AnswerContent
I would like to cache the returned data to minimize the DB interaction.
Now, based on the user selected option, I have to return either of the below:
ShowAll: all data.
ShowAnsweredOnly: faqList.Where(Answercontent != null)
ShowUnansweredOnly: faqList.Where(AnswerContent != null)
My Question:
Should I only cache all data returned from DB (e.g. FAQ_ALL) and filter other faqList modes from cache (= interacting with DB just once and filter the data from the cache item for the other two modes)? Or should I have 3 cache items: FAQ_ALL, FAQ_ANSWERED and FAQ_UNANSWERED (=interacting with database for each mode [3 times]) and return the cache item for each mode?
I'd be pleased if anyone tells me about pros/cons of each approach.
Food for thought.
How many records are you caching, how big are the tables?
How much mid-tier resources can be reserved for caching?
How many of each type data exists?
How fast filtering on the client side will be?
How often does the data change?
how often is it changed by the same application instance?
how often is it changed by other applications or server side jobs?
What is your cache invalidation policy?
What happens if you return stale data?
Can you/Should you leverage active cache invalidation, like SqlDependency or LinqToCache?
If the dataset is large then filtering on the client side will be slow and you'll need to cache two separate results (no need for a third if ALL is the union of the other two). If the data changes often then caching will return stale items frequently w/o a proactive cache invalidation in place. Active cache invalidation is achievable in the mid-tier if you control all the updates paths and there is only one mid-tier instance application, but becomes near really hard if one of those prerequisites is not satisfied.
It basically depends how volatile the data is, how much of it there is, and how often it's accessed.
For example, if the answered data didn't change much then you'd be safe caching that for a while; but if the unanswered data changed a lot (and more often) then your caching needs might be different. If this was the case it's unlikely that caching it as one dataset will be the best option.
It's not all bad though - if the discrepancy isn't too huge then you might be ok cachnig the lot.
The other point to think about is how the data is related. If the FAQ items toggle between answered and unanswered then it'd make sense to cache the base data as one - otherwise the items would be split where you wanted it together.
Alternatively, work with the data in-memory and treat the database as an add-on...
What do I mean? Well, typically the user will hit "save" this will invoke code which saves to the DB; when the next user comes along they will invoke a call which gets the data out of the DB. In terms of design the DB is a first class citizen, everything has to go through it before anyone else gets a look in. The alternative is to base the design around data which is held in-memory (by the BLL) and then saved (perhaps asynchronously) to the DB. This removes the DB as a bottleneck but gives you a new set of problems - like what happens if the database connection goes down or the server dies with data only in-memory?
Pros and Cons
Getting all the data in one call might be faster (by making less calls).
Getting all the data at once if it's related makes sense.
Granularity: data that is related and has a similar "cachability" can be cached together, otherwise you might want to keep them in separate cache partitions.

Fast, scalable string lookup

I have a set of 5 million strings. These are currently stored in a single column MySQL table. My application has to perform lookups and check if a given string is in the set. This can of course be done with a HashSet (in Java). But instead of building a custom solution, I was wondering if there are any existing, widely used, proven solutions that do this? It seems like a common scenario. The solution should be scalable (the set might increase beyond 5 million), have failover (so probably distributed) and perform well under a huge number of requests. Any suggestions?
Update: My app can also query to check if a given set of strings is present in the global (the 5 million one) set.
You can try Trie or Patricia-trie.The second is more memory efficient.Also here you can find a comparison of 2 data structures [Trie,TreeSet],In-memory database and their performance.
Try memcached, a high-performance, distributed memory object caching system. You lookup using key/value hashes. Facebook uses memcached as do many other highly scalable sites. Need to store more strings? Just add more memcached instances to the cluster. Plus you can use in a 2-tier caching setup where you first query memcached, if cache miss then query the full database.
Have you considered adding column indexing to your MySQL database? Hash, b-tree and r-tree are supported.
MySQL can also be replicated and clustered for high scalability.
While a Trie might be the best solution, binary search on the sorted list of strings should also perform well run time wise.

Resources