Storing tri-grams in database or generate on-the-fly? - string-matching

I'm trying to create an application which uses trigrams for approximate string matching. Now all the records are in the database and i want to be able to search the records on a fixed column. Is it best to have an additional field whihc contains the hashed version of the value i want to search (if so, whats the best way to store it?) or is it better to generate the trigrams on the fly?

Which database are you using?
PostgreSQL has trigram functions built in which work off of GiST or GiN indexes.
In SQL, I'm using CLR to create and compare trigram sets, works much, much faster than SQL code.

Related

Most efficient way to change synthetic partition key values

I have a collection with thousands of documents all of which have a synthetic partition key property like:
partitionKey: ‘some-document-related-value’
now i need to change values for partitionKey. of course, it takes recreation of documents in order to do so but i am wondering what is the most efficient/straightforward way to do it?
should i use azure function with cosmosdbtrigger? (set to start feed from begining)
change feed processor?
some other way?
i’m looking for quickest solution thats still reliable.
Yes, change feed is a common way to migrate data from one container to another. Another simple option may be to use Data Migration Tool where you build your new partition key in the select statement.
Hopefully this is helpful.

SQL Performance from .net: Insert via loop vs XML vs Sql Data Table Type

I have a handful of records, 5-10, that I need to take from the user and run a SQL merge statement against. I can think of three ways of accomplishing this.
.net Loop processing one record at a time - Wondering what the performance of this would be compared to the other options. I would think it is pretty good given connection pooling?
SQL Data Table type - I have seen these used elsewhere in the project, but as I learned first hand these are a pain to update the table definitions if need, dropping the entire object and recreating
XML variable - I have used this in the past. I like it because it is flexible to change the definition of the object. The .net is simple with XMLSerializer. But I am sure there is probably a performance hit to call XMLSerializer. And then on the SQL side to use the .nodes() function.
Does anyone know by personal experience or some reference, such as a white paper, which method is the most efficient when inserting/updating records in a database via .net application?
For 5-10 items you can use "clasic" insert with more records.
INSERT INTO MyTable
(ColumnA, ColumnB, ColumnC)
VALUES
(#ColumnA_0, #ColumnB_0, #ColumnC_0),
(#ColumnA_1, #ColumnB_1, #ColumnC_1),
(#ColumnA_2, #ColumnB_2, #ColumnC_2)
This is MUCH faster than XML or DataTable. And is faster than isolated inserts in loop.
The limit for number of inserted records is 1000. If you want more, you need execute more statements.

SQLite fulltext virtual table normally usable?

Even after reading a lot about the fulltext index of SQLite and a question arises that I didn't see answered anywhere:
I already have a table that I want to search with the fulltext index. I would just create an extra virtual table USING FTS3 or USING FTS4 and then INSERT my data into it.
Does that then use the double storage in total? Can I use such a virtual table just like a normal table and thus preventing storing the data twice?
(I am working with SQLite on Android but this question may apply to usage on any SQLite compatible platform.)
Despite the fact you did found some details I'll try to provide detailed answer:
1. Does that then use the double storage in total?
Yes it does. Moreover it might use event more space. For example, for widely known Enron E-Mail Dataset and FTS3 example, just feel the difference:
The FTS3 table consumes around 2006 MB on disk compared to just
1453 MB for the ordinary table
The FTS3 table took just under 31
minutes to populate, versus 25 for the ordinary table
Which makes the situation a bit unpleasant, but still full-text search worth it.
2. Can I use such a virtual table just like a normal table?
The short answer no, you can't. Virtual table is just a some kind of a View with several limitations. You've noticed several already.
Generally saying you should not use any feature which is seems to be unnatural for a View. Just a bare minimum required to let your application fully utilize the power of full-text search. So there will be no surprises later, with newer version of the module.
There is no magic behind this solution, it is just a trade-off between performance, required disk space and functionality.
Final conclusion
I would highly recommend to use FTS4, because it is faster and the only drawback is additional storage space needed.
Anyway, you have to carefully design virtual table taking into account a supplementary and highly specialized nature of such solution. In the other words, do not try to replace your initial table with the virtual one. Use both with a great care.
Update
I would recommend to look through the following article: iOS full-text search with Core Data and SQLite. Several interesting moments:
The virtual table is created in the same SQLite database in wich the Core Data content resides. To keep this table as light as possible
only object properties relevant to the search query are inserted.
SQLite implementation offers something Core Data does not: full-text search. Next to that, it performs almost 10% faster and at least
660% more (memory) efficiently than a comparable Core Data query.
I just found out the main differences of virtual tables and it seems to depend on your usage whether a single table suffices for you.
One cannot create a trigger on a virtual table.
One cannot create additional indices on a virtual table. (Virtual tables can have indices but that must be built into the virtual table
implementation. Indices cannot be added separately using CREATE INDEX
statements.)
One cannot run ALTER TABLE ... ADD COLUMN commands against a virtual table.
So if you need another index on the table, you need to use two tables.

What is the best way to CRUD dynamically created tables?

I'm creating(ed) an ASP.NET application (SQL Server backend) that allows the user (a business) to create their own tables and fields. They will all be child tables of a parent table (non-dynamic) and have proper PK/FK relationships (default fields when the table is created).
However, I don't like my current method of updating/inserting and selecting the fields. I was going to create an SP that was passed the proper keys and table names, then have it return the proper SQL statement. I'm thinking that it might make more sense to just pass the name/value pairs of fields/values and have an SP actually process them. Is this the best way to do it? If so, I'm not good at SP's so any examples of how?
I don't have a lot of experience with the EAV model, but it does sound like it might be a good idea for implementing what you're trying to achieve. However, if you already have a system in place, an overhaul could be very expensive.
If the queries you're making against the user tables are basic CRUD operations, what about just creating CRUD stored procs for each table? E.g. -
Table:
acme_orders
Stored Procs:
acme_orders_insert
acme_orders_update
acme_orders_select
acme_orders_delete
... [other necessary procs]
I have no idea what the business needs are for these tables, but I imagine that whatever you're doing currently could be translated into doing the same thing with stored procs.
I was going to create an SP that was passed the proper keys and table names, then have it >return the proper SQL statement. I'm thinking that it might make more sense to just pass the >name/value pairs of fields/values and have an SP actually process them.
Assuming you mean the proc would generate and then execute the SQL (sometimes known as dynamic SQL) this can work, but it probably performs slower than static / compiled SQL, as in normal procs.

What is an index in SQLite?

I don't understand what an index is or does in SQLite. (NOT SQL) I think it allows for sorting in acending and decending order and access to data quicker. But I'm just guessing here.
Why not SQL? The answer is the same, though the internal details will differ between implementations.
Putting an index on a column tells the database engine to build, unsurprisingly, an index that allows it to quickly locate rows when you search for certain values in a column, without having to scan every row in the table.
A simple (and probably suboptimal) index might be built with an ordinary binary search tree.
Yes, indexes are all about improved data access performance (but at the cost of storage)
http://en.wikipedia.org/wiki/Index_(database)
An index (in any database) is a list of some kind which associates a sorted (or at least, quickly searchable) list of keys with information about where to find the rest of the data associated with the key.
You may not be finding information about this on the Internet because you're assuming it's a SQLite concept, but it's not - it's a general computer engineering concept.
Think about an address book. If you are searching the phone number of Rossi Mario, you know that surnames are ordered alphabetically so you can go to the letter R, then search for the letter o and so on. Index do the same, are a collections of references to entries that speed up a lot some operations.
Searching in an unordered address book would be much more slower, you should start from the first name on the first page and search in all the pages until you find the name you are looking for.
I think it allows for sorting in
acending and decending order and
access to data quicker.
Yes, that's what's it's for. Indexes create the abstraction of having sorted data, which speeds up searches significantly. With an index using a balanced binary search tree, searches take O(log N) instead of O(N) time.
What the other answers haven't mentioned that most databases use indexes in order to implement UNIQUE (and therefore also PRIMARY KEY) constraints. Because in order to ensure uniqueness, you have to be able to detect whether the key is already there, and this means you want fast searches for it.
Take a look in your SQLite database. Those sqlite_autoindex_ indices were created to enforce UNIQUE constraints.
The same as an index in any SQL (YES SQL) RDBMS.
You can see the SQLite query optimizer considers indexes: http://www.sqlite.org/optoverview.html
Speed up searching and sorting
Different types of SQLite indices speed up searching and sorting in different ways.
The following tutorial explains this in a great way.

Resources