What is an index in SQLite? - sqlite

I don't understand what an index is or does in SQLite. (NOT SQL) I think it allows for sorting in acending and decending order and access to data quicker. But I'm just guessing here.

Why not SQL? The answer is the same, though the internal details will differ between implementations.
Putting an index on a column tells the database engine to build, unsurprisingly, an index that allows it to quickly locate rows when you search for certain values in a column, without having to scan every row in the table.
A simple (and probably suboptimal) index might be built with an ordinary binary search tree.

Yes, indexes are all about improved data access performance (but at the cost of storage)
http://en.wikipedia.org/wiki/Index_(database)

An index (in any database) is a list of some kind which associates a sorted (or at least, quickly searchable) list of keys with information about where to find the rest of the data associated with the key.
You may not be finding information about this on the Internet because you're assuming it's a SQLite concept, but it's not - it's a general computer engineering concept.

Think about an address book. If you are searching the phone number of Rossi Mario, you know that surnames are ordered alphabetically so you can go to the letter R, then search for the letter o and so on. Index do the same, are a collections of references to entries that speed up a lot some operations.
Searching in an unordered address book would be much more slower, you should start from the first name on the first page and search in all the pages until you find the name you are looking for.

I think it allows for sorting in
acending and decending order and
access to data quicker.
Yes, that's what's it's for. Indexes create the abstraction of having sorted data, which speeds up searches significantly. With an index using a balanced binary search tree, searches take O(log N) instead of O(N) time.
What the other answers haven't mentioned that most databases use indexes in order to implement UNIQUE (and therefore also PRIMARY KEY) constraints. Because in order to ensure uniqueness, you have to be able to detect whether the key is already there, and this means you want fast searches for it.
Take a look in your SQLite database. Those sqlite_autoindex_ indices were created to enforce UNIQUE constraints.

The same as an index in any SQL (YES SQL) RDBMS.
You can see the SQLite query optimizer considers indexes: http://www.sqlite.org/optoverview.html

Speed up searching and sorting
Different types of SQLite indices speed up searching and sorting in different ways.
The following tutorial explains this in a great way.

Related

How to model data in dynamodb if your access pattern includes many WHERE conditions

I am a bit confused if this is possible in DynamoDB.
I will give an example of SQL and explain how the query could be optimized and then I will try to explain why I am confused on how to model this and how to access the same data in DynamoDB.
This is not company code. Just an example I made up based on pcpartpicker filter.
SELECT * FROM BUILDS
WHERE CPU='Intel' AND 'OVERCLOCKED'='true'
AND Price < 3000
AND GPU='GeForce RTX 3060'
AND ...
From my understanding, SQL will first do a scan on the BUILDS table and then filter out all the builds where CPU is using intel. From this subset, it then does another WHERE clause to filter 'OVERCLOCEKD' = true so on and so forth. Basically, all of the additional WHERE clauses have a smaller number of rows to filter.
One thing we can do to speed up this query is to create an index on these columns. The main increase in performance is reducing the initial scan on the whole table for the first clause that the database looks at. So in the example above instead of scanning the whole db to find builds that are using intel it can quickly retrieve them since it is indexed.
How would you model this data in DynamoDB? I know you can create a bunch of secondary Indexes but instead of letting the engine do the WHERE clause and passing along the result to do the next set of filtering. It seems like you would have to do all of this yourself. For example, we would need to use our secondary index to find all the builds that use intel, overclocked, less than 3000, and using a specific GPU and then we would need to find the intersection ourselves. Is there a better way to map out this access pattern? I am having a hard time figuring out if this is even possible.
EDIT:
I know I could also just use a normal filter but it seems like this would be pretty expensive since it basically brute force search through the table similar to the SQL solution without indexing.
To see what I mean from pcpartpicker here is the link to the site with this page: https://pcpartpicker.com/builds/
People basically select multiple filters so it makes designing for access patterns even harder.
I'd highly recommend going through the various AWS presentations on YouTube...
In particular here's a link to The Iron Triangle of Purpose - PIE Theorem chapter of the AWS re:Invent 2018: Building with AWS Databases: Match Your Workload to the Right Database (DAT301) presentation.
DynamoDB provides IE - Infinite Scale and Efficiency.
But you need P - Pattern Flexibility.
You'll need to decide if you need PI or PE.

DocumentDB Index Performance / Fragmentation

I have decided to implement the following ID strategy for my documents, which combines the document "type" with the ID:
doc.id = "docType_" + Guid.NewGuid().ToString("n");
// create document in collection
This results in IDs such as the following for my documents:
usr_19d17037ea7f41a9b20db1a90f71d30d
usr_89fe82c93b264076aa1b6e1fb4813aaf
usr_2aa58c1c970a4c5eaa206a755c1c7bf4
msg_ec43510732ae47a6a5d5f323b7461d68
msg_3b03ceeb7e06490d998c3e368b435851
With a RangeIndex policy in place on the ID, I should be able to query the collection for specific types. For example:
SELECT * FROM c WHERE STARTSWITH(c.id, 'usr_') AND ...
Since this is a web application with many different document types, many of my app's queries would implement this STARTSWITH filter by default.
My main concern here is the use of a random GUID string on the ID. I know that in SQL Server I have had issues with index performance and fragmentation while using random GUIDs on the primary key in a clustered index.
Is there a similar concern here? It seems that in DocumentDB, the care of managing indexes has been abstracted away from you. Would a sequential ID be more ideal/performant in any way?
tl;dr: Use separate fields for the type and a GUID-only ID and use hash indexes on both.
This answer is necessarily going to be somewhat opinionated based upon the nature of your questions. Let me first address what appears to be your primary concern, namely the fragmentation of indexes effecting performance.
DocumentDB assumes the use of GUIDs and a hash index (as opposed to a range index) is ideally suited to finding the one matching entity by GUID. On the other hand, if you want to find a set of documents by looking at the beginning of the string, I suspect that would probably be more performant with a range index. This assumes that STARTSWITH is only optimized when used with range indexes, but I don't know for a fact that it is optimized even when you have a range index.
My recommendation would be to use separate fields for the type and a GUID-only ID and use hash indexes on both. This gives you the advantage of being assured that queries like the one you show would be highly performant and that queries which combine a type clause with other parameters would also be able to use at least one index. Note, hash indexes of this type (say 2x 3 bytes = 6 bytes/document) are highly space efficient, so don't worry about needed two of them. Those two combined should be much smaller than one range index which needs to have enough precision to cover the entire length of your type+GUID.
Other than the performance and space reasons already discussed, I can see a couple of other disadvantages to combining the type with the GUID: 1) when trying to retrieve a single document (both for direct use and as part of a foreign key lookup), having the GUID separate and using a hash index will be faster and more space efficient than using a range index on the combined field; 2) Combining the type with the ID greatly complicates certain migrations that commonly need to be done at a later date. Let's say that you decide to break your users into authors and readers for example. Users are foreign key referenced in other document types (blog post author, reader comment, etc.) by the user ID. If that ID includes the type, then you would need to not only change the user documents to accomplish the migration but you'd also need to find and change every foreign key. If the two fields (GUID and type) were separate, then you'd only need to change the user documents. Agile software craftsmanship is largely about making decisions that provide flexibility down the road.
As for the use of a sequential index, the trend in databases in general and NoSQL in particular, is that the complexity of providing a monotonically increasing sequential ID is greater than the space-efficiency advantages of that over a GUID. If you are going to stick with DocumentDB, I recommend that you just go with the flow and use GUIDs.

Is a scan query always expensive in DynamoDB or should you use a range key

I've been playing around with Amazon DynamoDB and looking through their examples but I think I'm still slightly confused by the example. I've created the example data on a local dynamodb instance to get used to querying data etc. The sample data sets up 3 tables of 'Forum'->'Thread'->'Reply'
Now if I'm in a specific forum, the thread table has a ForumName key I can query against to return relevant threads, but would the very top level (displaying the forums) always have to be a scan operation?
From what I can gather the only way to "select *" in dynamodb is to use a scan and I assume in this instance - where forum is very high level and might have a relatively small number of rows - that it wouldn't be that expensive or are you actually better creating a hash and range key and using that to query this table? I'm not sure what the range key would be in this instance, maybe just a number and then specify in the query that the value has to be > 0? Or perhaps a date it was created and the query always uses a constant date in the past?
I did try a sample query on the 'Forum' table example data using a ComparisonOperator of 'GE' (Greater than or equal) with an attribute value list of 'S'=>'a' but this states that any conditions on the hash key must be of type EQ which implies I couldn't do the above as I would always need to know my 'Name' values upfront
Maybe I'm still struggling having come from an RDBS background especially seen as there are many forum examples out there.
thanks
I think using Scan to get all the forums is fine. I think it is very efficient because it will not return you anything that you don't need (all of the work that scan does is necessary). Also since Scan operation is so simple it is easier to implement and more likely to be efficient

Aggregating and deduplicationg information extracted from multiple web sites

I am working on building a database of timing and address information of restaurants those are extracted from multiple web sites. As information for same restaurants may be present in multiple web sites. So in the database I will have some nearly duplicate copies.
As the number of restaurants is large say, 100000. Then for each new entry I have to do order of 100000^2 comparison to check if any restaurant information with nearly similar name is already present. So I am asking whether there is any efficient approach better than that is possible. Thank you.
Basically, you're looking for a record linkage tool. These tools can index records, then for each record quickly locate a small set of potential candidates, then do more detailed comparison on those. That avoids the O(n^2) problem. They also have support for cleaning your data before comparison, and more sophisticated comparators like Levenshtein and q-grams.
The record linkage page on Wikipedia used to have a list of tools on it, but it was deleted. It's still there in the version history if you want to go look for it.
I wrote my own tool for this, called Duke, which uses Lucene for the indexing, and has the detailed comparators built in. I've successfully used it to deduplicate 220,000 hotels. I can run that deduplication in a few minutes using four threads on my laptop.
One approach is to structure your similarity function such that you can look up a small set of existing restaurants to compare your new restaurant against. This lookup would use an index in your database and should be quick.
How to define the similarity function is the tricky part :) Usually you can translate each record to a series of tokens, each of which is looked up in the database to find the potentially similar records.
Please see this blog post, which I wrote to describe a system I built to find near duplicates in crawled data. It sounds very similar to what you want to do and since your use case is smaller, I think your implementation should be simpler.

Using Lucene.Net as a primary lookup for lists before heading to the database, is this a good idea?

First, I do not want to use Lucene as a database, per se, but rather as the primary look-up for displaying lists to the user. This would be a canned search to Lucene where we would pull, say, all user information to be displayed in a grid list. We are building an ASP.Net web application, first of all. Is it a good idea to pull, from Lucene initially, a list of items (that can be paged) to display to the user in some sort of grid format? The only time we would call the database is when a user selects a specific record to view or update.
My concern is stale data coming from Lucene. I have been looking for information about add and updates to an index, but it is unclear to me if my scenario is better suited for a database rather than Lucene. My other developers and I have been going back and forth about this, but unfortuneatley, we don't know enough about how Lucene handles writes and reads.
I'm not sure if it's a good or bad fit for your use case. Hopefully I can give you some insight on how Lucene stores its data, and you can make a decision from that.
Lucene is extremely quick if you want to search for an item in its index. The time it takes to index its items isn't so quick. It's by no means slow if you look at everything its doing, but it adds complexity to know what you need to do about it.
Lucene is essentially a document store. So each item in Lucene is a Document, which can hold a certain amount of fields. Those fields are essentially key value pairs, though right now, Lucene only supports types of string and byte[] as values, and strings only as keys. Each field can be index and/or analyzed (or neither). Indexing simply means you can search against that field's data, generally only via exact matches and wildcards. Analyzing gives you better searching capabilities, since it will take the string and tokenize it. Depending on the analyzer it will tokenize it differently. The most common is whitespace and stopwords; essentially marking each word as a term unless its something like (a, an, the, as, etc...).
The real killer when used for many pieces, you can't update a document in an index. When you pull out a document to update it and change the field, the call to UpdateDocument() actually marks the old document as deleted and inserts a new document.
Notice I said it marks it as deleted. That introduces another thing related to Lucene indexes: Optimization of the index. When you write to an index, every so often a segment of the index is written to disk. (It's temporarily stored in RAM for fast indexing) When you run a search on an index, lucene needs to open all those different segments to find the terms to search against (it has to order them in a way too). This means if you have many segments, searching can be slow. A call to Optimize() will not only merge the segments together, it will also remove any documents marked for deletion, thus lowering your index size, as well.
However, optimizing your index requires around 1.5x more space while the optimization is being done, sometimes more. Fortunately, Lucene.net is transactional during an optimization, which means not only will your index not be corrupt if an optimization fails, any existing IndexReader you have open will still be able to search and read from the index when you're optimizing it.
In short, if it were me, if you were expecting only get one result from a search each time, I may not recommend lucene. Lucene especially shines when you're searching through many documents for many documents. It's an inverted index and it's good at that. For a single lookup, you may be better off with a database. Unfortunately, the only way you'll really find out is to benchmark it. Fortunately, at least Lucene.Net is very easy to setup for something like that.
Also, if you do use Lucene.Net, consider our 2.9.4g branch. You may not be able to use it, since it is technically not release code, but it is a bit faster than normal lucene, as we've added generics and removed a bit of the costly boxing done in previous versions.
Lucene is not a good fit for the scenario you're describing. You're looking at caching data.
Why not use the Asp.net cache? If you need a more robust caching solution, there's memcached and a whole host of other ones ... even NoSql stores like mongo, redis, etc.
Obviously, you'll need to manually remove items from the cache on updates to stop serving stale data.
I think this is a viable solution, and I say this because there is a major open source content management system that is using a technique very similar to what you've described. It's called Umbraco, and it's version 5 is going to be using a customized version of Lucene.NET for a sort of cache.
you can look at the project and source here: http://umbraco.codeplex.com/SourceControl/changeset/view/5a7c9af9bbf9

Resources