Berkeley DB: btree prefix comparison for directory-like keys? - berkeley-db

I'm going to index a BDB with keys that look a lot like directory paths ('/foo/bar', '/foo/baz', etc, with levels of slashes generally < 10).
Does anybody have any experience with using a Btree prefix comparison routine[1] for this? Are the savings worthwhile? Any references to experience papers on this subject?
[1] http://www.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/am_conf/bt_prefix.html

You may want to post your question to the Berkeley DB forum on OTN here. There is a active community of Support, Engineering and BDB application developers that interact directly in this forum.
What I've heard from customers and our own use of Btree prefixing in the BDB XML product is that it can significantly reduce the size of the internal btree nodes, also improving the efficiency of the cache, reducing I/O and thereby improving the efficiency of individual key lookups. This is also stated in the documentation about the btree prefix function located here. The extent of the performance improvement depends on a) your data, b) your application data access patterns. If the key value is mostly identical, then you will save more space in your btree index. If your data access patterns perform many key lookups and by having a smaller btree you reduce the number of I/Os that you have to perform, the performance will improve commensurately.
Please note that if you provide a btree prefix function you must also provide a compatible btree comparison function.
For BDB XML we saw a 20-30 reduction in btree size.
The lexicographic key comparison/prefix functions with are used by default in Berkeley DB may already be providing the behavior that you want.
Good luck with your research.

Related

Scan Vs BatchGetItems in Dynamo-db

If I know the primary key of the items, Which approach is best approach
Scan with FilterExpression with IN Operator
BatchGetItem with all keys in request parameter
Please recommend the solution in terms of both latency and partitions impact.
Probably neither. Of course it all depends on the key schema and the data in the table, but you probably want to create an Global Secondary Index for your most frequently used queries.
Having said that; performing scans is highly discouraged, especially when working with large volumes of data. So if you know the primary key of the items you're interested in, go for BatchGetItems over doing a scan.

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

How to strike a performance balance with documentDB collection for multiple tenants?

Say I have:
My data stored in documetDB's collection for all of my tenants. (i.e. multiple tenants).
I configured the collection in such a way that all of my data is distributed uniformly across all partitions.
But partitions are NOT by each tenant. I use some other scheme.
Because of this data for a particular tenant is distributed across multiple partitions.
Here are my questions:
Is this the right thing to do to maximum performance for both reading and writing data?
What if I want to query for a particular tenant? What are the caveats in writing this query?
Any other things that I need to consider?
I would avoid queries across partitions, they come with quite a cost (basically multiply index and parsing costs with number of partitions - defaults to 25). It's fairly easy to try out.
I would prefer a solution where one can query on a specific partition, typically partitioning by tenant ID.
Remember that with partitioned collections, there's stil limits on each partition (10K RU and 10GB) - I have written about it here http://blog.ulriksen.net/notes-on-documentdb-partitioning/
It depends upon your usage patterns as well as the variation in tenant size.
In general for multi-tenant systems, 99% of all operations are within a single tenant. If you make the tenantID your partition key, then those operations will only touch a single partition. This won't make a single operation any faster (latency) but could provide huge throughput gains when under load by multiple tenants. However, if you only have 5 tenants and 1 of them is 10x bigger than all the others, then using the tenantID as your key will lead to a very unbalanced system.
We use the tenantID as the partition key for our system and it seems to work well. We've talked about what we would do if it became very unbalanced and one idea is to make the partition key be the tenantID + to split the large tenants up. We haven't had to do that yet though so we haven't worked out all of those details to know if that would actually be possible and performant, but we think it would work.
What you have described is a sensible solution, where you avoid data skews and load-balance across partitions well. Since the query for a particular tenant needs to touch all partitions, please remember to set FeedOptions.EnableCrossPartitionQuery to true (x-ms-documentdb-query-enablecrosspartition in the REST API).
DocumentDB site also has an excellent article on partitioned collections and tips for choosing a partition key in general. https://azure.microsoft.com/en-us/documentation/articles/documentdb-partition-data/

DocumentDB Index Performance / Fragmentation

I have decided to implement the following ID strategy for my documents, which combines the document "type" with the ID:
doc.id = "docType_" + Guid.NewGuid().ToString("n");
// create document in collection
This results in IDs such as the following for my documents:
usr_19d17037ea7f41a9b20db1a90f71d30d
usr_89fe82c93b264076aa1b6e1fb4813aaf
usr_2aa58c1c970a4c5eaa206a755c1c7bf4
msg_ec43510732ae47a6a5d5f323b7461d68
msg_3b03ceeb7e06490d998c3e368b435851
With a RangeIndex policy in place on the ID, I should be able to query the collection for specific types. For example:
SELECT * FROM c WHERE STARTSWITH(c.id, 'usr_') AND ...
Since this is a web application with many different document types, many of my app's queries would implement this STARTSWITH filter by default.
My main concern here is the use of a random GUID string on the ID. I know that in SQL Server I have had issues with index performance and fragmentation while using random GUIDs on the primary key in a clustered index.
Is there a similar concern here? It seems that in DocumentDB, the care of managing indexes has been abstracted away from you. Would a sequential ID be more ideal/performant in any way?
tl;dr: Use separate fields for the type and a GUID-only ID and use hash indexes on both.
This answer is necessarily going to be somewhat opinionated based upon the nature of your questions. Let me first address what appears to be your primary concern, namely the fragmentation of indexes effecting performance.
DocumentDB assumes the use of GUIDs and a hash index (as opposed to a range index) is ideally suited to finding the one matching entity by GUID. On the other hand, if you want to find a set of documents by looking at the beginning of the string, I suspect that would probably be more performant with a range index. This assumes that STARTSWITH is only optimized when used with range indexes, but I don't know for a fact that it is optimized even when you have a range index.
My recommendation would be to use separate fields for the type and a GUID-only ID and use hash indexes on both. This gives you the advantage of being assured that queries like the one you show would be highly performant and that queries which combine a type clause with other parameters would also be able to use at least one index. Note, hash indexes of this type (say 2x 3 bytes = 6 bytes/document) are highly space efficient, so don't worry about needed two of them. Those two combined should be much smaller than one range index which needs to have enough precision to cover the entire length of your type+GUID.
Other than the performance and space reasons already discussed, I can see a couple of other disadvantages to combining the type with the GUID: 1) when trying to retrieve a single document (both for direct use and as part of a foreign key lookup), having the GUID separate and using a hash index will be faster and more space efficient than using a range index on the combined field; 2) Combining the type with the ID greatly complicates certain migrations that commonly need to be done at a later date. Let's say that you decide to break your users into authors and readers for example. Users are foreign key referenced in other document types (blog post author, reader comment, etc.) by the user ID. If that ID includes the type, then you would need to not only change the user documents to accomplish the migration but you'd also need to find and change every foreign key. If the two fields (GUID and type) were separate, then you'd only need to change the user documents. Agile software craftsmanship is largely about making decisions that provide flexibility down the road.
As for the use of a sequential index, the trend in databases in general and NoSQL in particular, is that the complexity of providing a monotonically increasing sequential ID is greater than the space-efficiency advantages of that over a GUID. If you are going to stick with DocumentDB, I recommend that you just go with the flow and use GUIDs.

What is an index in SQLite?

I don't understand what an index is or does in SQLite. (NOT SQL) I think it allows for sorting in acending and decending order and access to data quicker. But I'm just guessing here.
Why not SQL? The answer is the same, though the internal details will differ between implementations.
Putting an index on a column tells the database engine to build, unsurprisingly, an index that allows it to quickly locate rows when you search for certain values in a column, without having to scan every row in the table.
A simple (and probably suboptimal) index might be built with an ordinary binary search tree.
Yes, indexes are all about improved data access performance (but at the cost of storage)
http://en.wikipedia.org/wiki/Index_(database)
An index (in any database) is a list of some kind which associates a sorted (or at least, quickly searchable) list of keys with information about where to find the rest of the data associated with the key.
You may not be finding information about this on the Internet because you're assuming it's a SQLite concept, but it's not - it's a general computer engineering concept.
Think about an address book. If you are searching the phone number of Rossi Mario, you know that surnames are ordered alphabetically so you can go to the letter R, then search for the letter o and so on. Index do the same, are a collections of references to entries that speed up a lot some operations.
Searching in an unordered address book would be much more slower, you should start from the first name on the first page and search in all the pages until you find the name you are looking for.
I think it allows for sorting in
acending and decending order and
access to data quicker.
Yes, that's what's it's for. Indexes create the abstraction of having sorted data, which speeds up searches significantly. With an index using a balanced binary search tree, searches take O(log N) instead of O(N) time.
What the other answers haven't mentioned that most databases use indexes in order to implement UNIQUE (and therefore also PRIMARY KEY) constraints. Because in order to ensure uniqueness, you have to be able to detect whether the key is already there, and this means you want fast searches for it.
Take a look in your SQLite database. Those sqlite_autoindex_ indices were created to enforce UNIQUE constraints.
The same as an index in any SQL (YES SQL) RDBMS.
You can see the SQLite query optimizer considers indexes: http://www.sqlite.org/optoverview.html
Speed up searching and sorting
Different types of SQLite indices speed up searching and sorting in different ways.
The following tutorial explains this in a great way.

Resources