SQLite fulltext virtual table normally usable? - sqlite

Even after reading a lot about the fulltext index of SQLite and a question arises that I didn't see answered anywhere:
I already have a table that I want to search with the fulltext index. I would just create an extra virtual table USING FTS3 or USING FTS4 and then INSERT my data into it.
Does that then use the double storage in total? Can I use such a virtual table just like a normal table and thus preventing storing the data twice?
(I am working with SQLite on Android but this question may apply to usage on any SQLite compatible platform.)

Despite the fact you did found some details I'll try to provide detailed answer:
1. Does that then use the double storage in total?
Yes it does. Moreover it might use event more space. For example, for widely known Enron E-Mail Dataset and FTS3 example, just feel the difference:
The FTS3 table consumes around 2006 MB on disk compared to just
1453 MB for the ordinary table
The FTS3 table took just under 31
minutes to populate, versus 25 for the ordinary table
Which makes the situation a bit unpleasant, but still full-text search worth it.
2. Can I use such a virtual table just like a normal table?
The short answer no, you can't. Virtual table is just a some kind of a View with several limitations. You've noticed several already.
Generally saying you should not use any feature which is seems to be unnatural for a View. Just a bare minimum required to let your application fully utilize the power of full-text search. So there will be no surprises later, with newer version of the module.
There is no magic behind this solution, it is just a trade-off between performance, required disk space and functionality.
Final conclusion
I would highly recommend to use FTS4, because it is faster and the only drawback is additional storage space needed.
Anyway, you have to carefully design virtual table taking into account a supplementary and highly specialized nature of such solution. In the other words, do not try to replace your initial table with the virtual one. Use both with a great care.
Update
I would recommend to look through the following article: iOS full-text search with Core Data and SQLite. Several interesting moments:
The virtual table is created in the same SQLite database in wich the Core Data content resides. To keep this table as light as possible
only object properties relevant to the search query are inserted.
SQLite implementation offers something Core Data does not: full-text search. Next to that, it performs almost 10% faster and at least
660% more (memory) efficiently than a comparable Core Data query.

I just found out the main differences of virtual tables and it seems to depend on your usage whether a single table suffices for you.
One cannot create a trigger on a virtual table.
One cannot create additional indices on a virtual table. (Virtual tables can have indices but that must be built into the virtual table
implementation. Indices cannot be added separately using CREATE INDEX
statements.)
One cannot run ALTER TABLE ... ADD COLUMN commands against a virtual table.
So if you need another index on the table, you need to use two tables.

Related

How to model data in dynamodb if your access pattern includes many WHERE conditions

I am a bit confused if this is possible in DynamoDB.
I will give an example of SQL and explain how the query could be optimized and then I will try to explain why I am confused on how to model this and how to access the same data in DynamoDB.
This is not company code. Just an example I made up based on pcpartpicker filter.
SELECT * FROM BUILDS
WHERE CPU='Intel' AND 'OVERCLOCKED'='true'
AND Price < 3000
AND GPU='GeForce RTX 3060'
AND ...
From my understanding, SQL will first do a scan on the BUILDS table and then filter out all the builds where CPU is using intel. From this subset, it then does another WHERE clause to filter 'OVERCLOCEKD' = true so on and so forth. Basically, all of the additional WHERE clauses have a smaller number of rows to filter.
One thing we can do to speed up this query is to create an index on these columns. The main increase in performance is reducing the initial scan on the whole table for the first clause that the database looks at. So in the example above instead of scanning the whole db to find builds that are using intel it can quickly retrieve them since it is indexed.
How would you model this data in DynamoDB? I know you can create a bunch of secondary Indexes but instead of letting the engine do the WHERE clause and passing along the result to do the next set of filtering. It seems like you would have to do all of this yourself. For example, we would need to use our secondary index to find all the builds that use intel, overclocked, less than 3000, and using a specific GPU and then we would need to find the intersection ourselves. Is there a better way to map out this access pattern? I am having a hard time figuring out if this is even possible.
EDIT:
I know I could also just use a normal filter but it seems like this would be pretty expensive since it basically brute force search through the table similar to the SQL solution without indexing.
To see what I mean from pcpartpicker here is the link to the site with this page: https://pcpartpicker.com/builds/
People basically select multiple filters so it makes designing for access patterns even harder.
I'd highly recommend going through the various AWS presentations on YouTube...
In particular here's a link to The Iron Triangle of Purpose - PIE Theorem chapter of the AWS re:Invent 2018: Building with AWS Databases: Match Your Workload to the Right Database (DAT301) presentation.
DynamoDB provides IE - Infinite Scale and Efficiency.
But you need P - Pattern Flexibility.
You'll need to decide if you need PI or PE.

SQL. Re: inner joins & foreign keys

Good day.
I have a basic question on SQL and table structure.
What we have now: 17 tables. These tables include 1 admin table. The other 13 tables are all branched off 3 "main" tables: customers, CareWorkers, Staff.
If I'm wanting to adhere to ACID ideology, I want to then create tables that each houses unique information.
My question is, and what I'm trying to wrap my head around, when I create each of these "nested-deeper" (not sure what to call it) tables, I simply do an inner join statement to grab the foreign key on my ASP.NET app correct?
First, inner join is how you get your tables "back together", and #SpectralGhost's example is how you do it. But you might want to consider doing it in the database rather than in your ASP code. The way you do that is with views. If you create a view (the syntax is CREATE VIEW and there are plenty of examples out there) then you can make the database schema as complex as you need to without making it hard to use in your ASP application. You can even make views updatable (you define an "INSTEAD OF" trigger, again, many examples if you search).
But you probably don't want to update a view, or a table, directly from your ASP code. You probably want to define STORED PROCEDUREs that update your data, and call those from your ASP code. This allows you to restrict access to your tables and views to read only and force any writes to come through a stored procedure you can control better. This prevents SQL INJECTION, making your ASP application much more secure. If the service account the application pool you ASP page runs under can pass raw queries to the database then any compromise can do tremendous damage to your database. If all it can do is execute a stored procedure where the parameters can be changed but not the functionality, they can only put some junk values in, or maybe not even that if you range check well.
The last bit of advice is that you are not preserving "ACID", you are preserving "NORMALIZED". It's definitely a tough concept to wrap your head around, here's a resource that helped me out a great deal when I was starting out. http://www.marcrettig.com/data-normalization-poster/ I still have a copy on my wall. You shouldn't obsess over normalization, but you should definitely keep it in mind and stick to it when you reasonably can. Again, there are numerous resources a search will get you, but the basic benefit is a normalized database is much more resistant to consistency problems, and is more storage efficient. And since disk IO is slow, storage efficient is usually query efficient too.
They are related tables. You should have at least one table with a primary key and often several that related back to that table from that table's foreign key.
TableOne
TableOneID
TableTwo
TableTwoID
TableOneID
TableTwo relates to TableOne via TableOneID. An inner join would give you where there are records in both tables based on your join. Example:
SELECT *
FROM TableOne t1
INNER JOIN TableTwo t2 ON t1.TableOneID=t2.TableOneID
Specifically how to do this in your application depends on your design. If you are using an ORM, then actual SQL is not terribly important. If you are using stored procedures, then it is.

Caching result of SELECT statement for reuse in multiple queries

I have a reasonably complex query to extract the Id field of the results I am interested in based on parameters entered by the user.
After extracting the relevant Ids I am using the resulting set of Ids several times, in separate queries, to extract the actual output record sets I want (by joining to other tables, using aggregate functions, etc).
I would like to avoid running the initial query separately for every set of results I want to return. I imagine my situation is a common pattern so I am interested in what the best approach is.
The database is in MS SQL Server and I am using .NET 3.5.
It would definitely help if the question contained some measurements of the unoptimized solution (data sizes, timings). There is a variety of techniques that could be considered here, some listed in the other answers. I will assume that the reason why you do not want to run the same query repeatedly is performance.
If all the uses of the set of cached IDs consist of joins of the whole set to additional tables, the solution should definitely not involve caching the set of IDs outside of the database. Data should not travel there and back again if you can avoid it.
In some cases (when cursors or extremely complex SQL are not involved) it may be best (even if counterintuitive) to perform no caching and simply join the repetitive SQL to all desired queries. After all, each query needs to be traversed based on one of the joined tables and then the performance depends to a large degree on availability of indexes necessary to join and evaluate all the remaining information quickly.
The most intuitive approach to "caching" the set of IDs within the database is a temporary table (if named #something, it is private to the connection and therefore usable by parallel independent clients; or it can be named ##something and be global). If the table is going to have many records, indexes are necessary. For optimum performance, the index should be a clustered index (only one per table allowed), or be only created after constructing that set, where index creation is slightly faster.
Indexed views are cleary preferable to temporary tables except when the underlying data is read only during the whole process or when you can and want to ignore such updates to keep the whole set of reports consistent as far as the set goes. However, the ability of indexed views to always accurately project the underlying data comes at a cost of slowing down those updates.
One other answer to this question mentions stored procedures. This is largely a way of organizing your code. However, it if you go this way, it is preferable to avoid using temporary tables, because such references to a temporary table prevent pre-compilation of the stored procedure; go for views or indexed views if you can.
Regardless of the approach you choose, do not guess at the performance characteristics and query optimizer behavior. Learn to display query execution plans (within SQL Server Management Studio) and make sure that you see index accesses as opposed to nested loops combining multiple large sets of data; only add indexes that demonstrably and drastically change the performance of your queries. A well chosen index can often change the performance of a query by a factor of 1000, so this is somewhat complex to learn but crucial for success.
And last but not least, make sure you use UPDATE STATISTICS when repopulating the database (and nightly in production), or your query optimizer will not be able to put the indexes you have created to their best uses.
If you are planning to cache the result set in your application code, then ASP.NET has cache, Your Winform will have the object holding the data with it with which you can reuse the data.
If planning to do the same in SQL Server, you might consider using indexed views to find out the Id's. The view will be materialized and hence you can get the results faster. You might even consider using a staging table to hold the id's temporarily.
With SQL Server 2008 you can pass table variables as params to SQL. Just cache the IDs and then pass them as a table variable to the queries that fetch the data. The only caveat of this approach is that you have to predefine the table type as UDT.
http://msdn.microsoft.com/en-us/library/bb510489.aspx
For SQL Server, Microsoft generally recommends using stored procedures whenever practical.
Here are a few of the advantages:
http://blog.sqlauthority.com/2007/04/13/sql-server-stored-procedures-advantages-and-best-advantage/
* Execution plan retention and reuse
* Query auto-parameterization
* Encapsulation of business rules and policies
* Application modularization
* Sharing of application logic between applications
* Access to database objects that is both secure and uniform
* Consistent, safe data modification
* Network bandwidth conservation
* Support for automatic execution at system start-up
* Enhanced hardware and software capabilities
* Improved security
* Reduced development cost and increased reliability
* Centralized security, administration, and maintenance for common routines
It's also worth noting that, unlike other RDBMS vendors (like Oracle, for example), MSSQL automatically caches all execution plans:
http://msdn.microsoft.com/en-us/library/ms973918.aspx
However, for the last couple of versions of SQL Server, execution
plans are cached for all T-SQL batches, regardless of whether or not
they are in a stored procedure
The best approach depends on how often the Id changes, or how often you want to look it up again.
One technique is to simply store the result in the ASP.NET object cache, using the Cache object (also accessible from HttpRuntime.Cache). For example (from a page):
this.Cache["key"] = "value";
There are many possible variations on this theme.
You can use Memcached to cache values in the memory.
As I see there are some .net ports.
How frequently does the data change that you'll be querying? To me, this sounds like a perfect scenario for data warehousing, where you flatting the data for quicker data retrieval and create the tables exactly as your 'DTO' wants to see the data. This method is different than an indexed view in that it's simply a table which will have quick seek operations, and could especially be improved if you setup the indexes properly on the columns that you plan to query
You can create Global temporary Table. Create the table on the fly. Now insert the records as per your request. Access this table in your next request in your joins... for reusability

Using Lucene.Net as a primary lookup for lists before heading to the database, is this a good idea?

First, I do not want to use Lucene as a database, per se, but rather as the primary look-up for displaying lists to the user. This would be a canned search to Lucene where we would pull, say, all user information to be displayed in a grid list. We are building an ASP.Net web application, first of all. Is it a good idea to pull, from Lucene initially, a list of items (that can be paged) to display to the user in some sort of grid format? The only time we would call the database is when a user selects a specific record to view or update.
My concern is stale data coming from Lucene. I have been looking for information about add and updates to an index, but it is unclear to me if my scenario is better suited for a database rather than Lucene. My other developers and I have been going back and forth about this, but unfortuneatley, we don't know enough about how Lucene handles writes and reads.
I'm not sure if it's a good or bad fit for your use case. Hopefully I can give you some insight on how Lucene stores its data, and you can make a decision from that.
Lucene is extremely quick if you want to search for an item in its index. The time it takes to index its items isn't so quick. It's by no means slow if you look at everything its doing, but it adds complexity to know what you need to do about it.
Lucene is essentially a document store. So each item in Lucene is a Document, which can hold a certain amount of fields. Those fields are essentially key value pairs, though right now, Lucene only supports types of string and byte[] as values, and strings only as keys. Each field can be index and/or analyzed (or neither). Indexing simply means you can search against that field's data, generally only via exact matches and wildcards. Analyzing gives you better searching capabilities, since it will take the string and tokenize it. Depending on the analyzer it will tokenize it differently. The most common is whitespace and stopwords; essentially marking each word as a term unless its something like (a, an, the, as, etc...).
The real killer when used for many pieces, you can't update a document in an index. When you pull out a document to update it and change the field, the call to UpdateDocument() actually marks the old document as deleted and inserts a new document.
Notice I said it marks it as deleted. That introduces another thing related to Lucene indexes: Optimization of the index. When you write to an index, every so often a segment of the index is written to disk. (It's temporarily stored in RAM for fast indexing) When you run a search on an index, lucene needs to open all those different segments to find the terms to search against (it has to order them in a way too). This means if you have many segments, searching can be slow. A call to Optimize() will not only merge the segments together, it will also remove any documents marked for deletion, thus lowering your index size, as well.
However, optimizing your index requires around 1.5x more space while the optimization is being done, sometimes more. Fortunately, Lucene.net is transactional during an optimization, which means not only will your index not be corrupt if an optimization fails, any existing IndexReader you have open will still be able to search and read from the index when you're optimizing it.
In short, if it were me, if you were expecting only get one result from a search each time, I may not recommend lucene. Lucene especially shines when you're searching through many documents for many documents. It's an inverted index and it's good at that. For a single lookup, you may be better off with a database. Unfortunately, the only way you'll really find out is to benchmark it. Fortunately, at least Lucene.Net is very easy to setup for something like that.
Also, if you do use Lucene.Net, consider our 2.9.4g branch. You may not be able to use it, since it is technically not release code, but it is a bit faster than normal lucene, as we've added generics and removed a bit of the costly boxing done in previous versions.
Lucene is not a good fit for the scenario you're describing. You're looking at caching data.
Why not use the Asp.net cache? If you need a more robust caching solution, there's memcached and a whole host of other ones ... even NoSql stores like mongo, redis, etc.
Obviously, you'll need to manually remove items from the cache on updates to stop serving stale data.
I think this is a viable solution, and I say this because there is a major open source content management system that is using a technique very similar to what you've described. It's called Umbraco, and it's version 5 is going to be using a customized version of Lucene.NET for a sort of cache.
you can look at the project and source here: http://umbraco.codeplex.com/SourceControl/changeset/view/5a7c9af9bbf9

Inner join across multiple access db's

I am re-designing an application for a ASP.NET CMS that I really don't like. I have made som improvements in performance only to discover that not only does this CMS use MS SQL but some users "simply" use MS Access database.
The problem is that I have some tables which I inner join, that with the MS Access version are in two different files. I am not allowed to simply move the tables to the other mdb file.
I am now trying to figure out a good way to "inner join" across multiple access db files?
It would really be a pity if I have fetch all the data and the do it programmatically!
Thanks
You don't need linked tables at all. There are two approaches to using data from different MDBs that can be used without a linked table. The first is to use "IN 'c:\MyDBs\Access.mdb'" in the FROM clause of your SQL. One of your saved queries would be like:
SELECT MyTable.*
FROM MyTable IN 'c:\MyDBs\Access.mdb'
and the other saved query would be:
SELECT OtherTable.*
FROM OtherTable IN 'c:\MyDBs\Other.mdb'
You could then save those queries, and then use the saved queries to join the two tables.
Alternatively, you can manage it all in a single SQL statement by specifying the path to the source MDB for each table in the FROM clause thus:
SELECT MyTable.ID, OtherTable.OtherField
FROM [c:\MyDBs\Access.mdb].MyTable
INNER JOIN [c:\MyDBs\Other.mdb].OtherTable ON MyTable.ID = OtherTable.ID
Keep one thing in mind, though:
The Jet query optimizer won't necessarily be able to use the indexes from these tables for the join (whether it will use them for criteria on individual fields is another question), so this could be extremely slow (in my tests, it's not, but I'm not using big datasets to test). But that performance issue applies to linked tables, too.
If you have access to the MDBs, and are able to change them, you might consider using Linked Tables. Access provides the ability to link to external data (in other MDBs, in Excel files, even in SQL Server or Oracle), and then you can perform your joins against the links.
I'd strongly encourage performance testing such an option. If it's feasible to migrate users of the Access databases to another system (even SQL Express), that would also be preferable -- last I checked, there are no 64-bit JET drivers for ODBC anymore, so if the app is ever hosted in a 64-bit environment, these users will be hosed.
Inside one access DB you can create "linked tables" that point to the other DB. You should (I think) be able to query the tables as if they both existed in the same DB.
It does mean you have to change one of the DBs to create the virtual table, but at least you're not actually moving the data, just making a pointer to it
Within Access, you can add remote tables through the "Linked Table Manager". You could add the links to one Access file or the other, or you could create a new Access file that references the tables in both files. After this is done, the inner-join queries are no different than doing them in a single database.

Resources