When constructing indexing tree from existing data, there is a bulk-loading algorithm, like
https://en.wikipedia.org/wiki/B%2B_tree#Bulk-loading
https://www.youtube.com/watch?v=HJgXVxsO5YU
When creating an index for a non-empty table, does SQLite use bulk-loading or create indexing by insertions? From my performance test, it seems that SQLite uses insertion to create indexing because the time costs between inserting table after indexing and creating indexing after insertion are similar.
Do we know why bulk-loading is not used? Does it not work well in practice?
Bulk loading requires that the data is already sorted.
SQLite implements sorting by inserting the rows into a temporary index, so using it for bulk loading would not be productive.
Related
We are using DynamoDB and have some complex queries that would be very easily handled using code instead of trying to write a complicated DynamoDB scan operation. Is it better to write a scan operation or just pull the minimal amount of data using a query operation (query on the hash key or a secondary index) and perform further filtering and reduction in the calling code itself? Is this considered bad practice or something that is okay to do in NoSQL?
Unfortunately, it depends.
If you have an even modestly large table a table scan is not practical.
If you have complicated query needs the best way to tackle that using DynamoDB is using Global Secondary Indexes (GSIs) to act as projections on the fields that you want. You can use techniques such as sparse indexes (creating a GSI on fields that only exist on a subset of the objects) and composite attributes keys (concatenating two or more attributes and using this as a new attribute to create a GSI on).
However, to directly address the question "Is it okay to filter using code instead of the NoSQL database?" the answer would be yes, that is an acceptable approach. The reason for performing filters in DynamoDB is not to reduce the "cost" of the query, that is actually the same, but to decrease unnecessary data transfer over the network.
The ideal solution is to use a GSI to get to reduce the scope of what is returned to as close to what you want as possible, but if it is necessary some additional filtering can be fine to eliminate some records either through a filter in DynamoDB or using your own code.
I recently began exploring indexing in sqlite. I'm able to successfully create an index with my desired columns.
After doing this I took a look at the database to see that the index was created successfully only to find that sqlite had already auto created indexes for each of my tables:
"sqlite_autoindex_tablename_1"
These auto generated indices used two columns of each table, the two columns that make up my composite primary key. Is this just a normal thing for sqlite to do when using composite primary keys?
Since I'll be doing most of my queries based on these two columns, does it make sense to manually create indices, which are the exact same thing?
New to indices so really appreciate any support/feedback/tips, etc -- thank you!
SQLite requires an index to enforce the PRIMARY KEY constraint -- without an index, enforcing the constraint would slow dramatically as the table grows in size. Constraints and indexes are not identical, but I don't know of any relational database that does not automatically create an index to enforce primary keys. So yes, this is normal behavior for any relational database.
If the purpose of creating an index is to optimize searches where you have an indexable search term that involves the first column in the index then there's no reason to create an additional index on the column(s) -- SQLite will use the automatically created one.
If your searches will involve the second column in the index without including an indexable term for the first column you will need to create your index. Neither SQLite (nor any other relational database I know of) can use composite indexes to optimize filtering when the head columns of the index are not specified in the search.
I am using sqlite3 (maybe sqlite4 in the future) and I need something like dynamic tables.
I have many tables with the same format: values_2012_12_27, values_2012_12_28, ... (number of tables is dynamic) and I want to select dynamically the table that receives some data.
I am using _sqlite3_prepare with INSERT INTO ? VALUES(?,?,?). Ofcourse this fails to compile (syntax error near ?). There is a nice and simple way to do this in sqlite ?
Thanks
Using SQL parameters is not possible for identifiers such as table or column names.
If you don't want to keep so many prepared statements around, just prepare them on the fly whenever you need one.
If your database were properly normalized, you would have a single big values table with an extra date column.
This organization is usually to be preferred, unless you have measured both and found that the better performance (if it actually exists) outweighs the overhead of managing multiple tables.
Even after reading a lot about the fulltext index of SQLite and a question arises that I didn't see answered anywhere:
I already have a table that I want to search with the fulltext index. I would just create an extra virtual table USING FTS3 or USING FTS4 and then INSERT my data into it.
Does that then use the double storage in total? Can I use such a virtual table just like a normal table and thus preventing storing the data twice?
(I am working with SQLite on Android but this question may apply to usage on any SQLite compatible platform.)
Despite the fact you did found some details I'll try to provide detailed answer:
1. Does that then use the double storage in total?
Yes it does. Moreover it might use event more space. For example, for widely known Enron E-Mail Dataset and FTS3 example, just feel the difference:
The FTS3 table consumes around 2006 MB on disk compared to just
1453 MB for the ordinary table
The FTS3 table took just under 31
minutes to populate, versus 25 for the ordinary table
Which makes the situation a bit unpleasant, but still full-text search worth it.
2. Can I use such a virtual table just like a normal table?
The short answer no, you can't. Virtual table is just a some kind of a View with several limitations. You've noticed several already.
Generally saying you should not use any feature which is seems to be unnatural for a View. Just a bare minimum required to let your application fully utilize the power of full-text search. So there will be no surprises later, with newer version of the module.
There is no magic behind this solution, it is just a trade-off between performance, required disk space and functionality.
Final conclusion
I would highly recommend to use FTS4, because it is faster and the only drawback is additional storage space needed.
Anyway, you have to carefully design virtual table taking into account a supplementary and highly specialized nature of such solution. In the other words, do not try to replace your initial table with the virtual one. Use both with a great care.
Update
I would recommend to look through the following article: iOS full-text search with Core Data and SQLite. Several interesting moments:
The virtual table is created in the same SQLite database in wich the Core Data content resides. To keep this table as light as possible
only object properties relevant to the search query are inserted.
SQLite implementation offers something Core Data does not: full-text search. Next to that, it performs almost 10% faster and at least
660% more (memory) efficiently than a comparable Core Data query.
I just found out the main differences of virtual tables and it seems to depend on your usage whether a single table suffices for you.
One cannot create a trigger on a virtual table.
One cannot create additional indices on a virtual table. (Virtual tables can have indices but that must be built into the virtual table
implementation. Indices cannot be added separately using CREATE INDEX
statements.)
One cannot run ALTER TABLE ... ADD COLUMN commands against a virtual table.
So if you need another index on the table, you need to use two tables.
I'm trying to create an application which uses trigrams for approximate string matching. Now all the records are in the database and i want to be able to search the records on a fixed column. Is it best to have an additional field whihc contains the hashed version of the value i want to search (if so, whats the best way to store it?) or is it better to generate the trigrams on the fly?
Which database are you using?
PostgreSQL has trigram functions built in which work off of GiST or GiN indexes.
In SQL, I'm using CLR to create and compare trigram sets, works much, much faster than SQL code.