Does SQLite 'merge' prefixes used by LIKE? - sqlite

I have a query like
select * from tbl where name like 'a%' or name like 'abc%';
Does SQLite search 'a%' and 'abc%' separately? Would it check abc% is included by a%, and do only one search?
"explain query plan" returns
"0" "0" "0" "SEARCH TABLE traces USING PRIMARY KEY (name>? AND name<?)"
"0" "0" "0" "SEARCH TABLE traces USING PRIMARY KEY (name>? AND name<?)"
Is it what happens at run time?

Does SQLite search 'a%' and 'abc%' separately? Would it check abc% is
included by a%, and do only one search?
I think neither is the correct answer as it appears to be in-between the two options given.
I think trawling through the documenting will explain a little.
First port of call is The SQLite Query Optimizer Overview . This says :-
If the WHERE clause is composed of constraints separate by the OR
operator then the entire clause is considered to be a single "term" to
which the OR-clause optimization is applied.
Addiotnally in EXPLAIN QUERY PLAN it states :-
If the WHERE clause of a query contains an OR expression, then SQLite
might use the "OR by union" strategy (also described here).
link included below to 1.8. OR-Connected Terms In The WHERE Clause
In this
case there will be two SEARCH records, one for each index, with the
same values in both the "order" and "from" columns. For example:
sqlite> CREATE INDEX i3 ON t1(b);
sqlite> EXPLAIN QUERY PLAN SELECT * FROM t1 WHERE a=1 OR b=2;
0|0|0|SEARCH TABLE t1 USING COVERING INDEX i2 (a=?)
0|0|0|SEARCH TABLE t1 USING INDEX i3 (b=?)
So it very much appears that the "OR by union" strategy is being used as you have:-
"0" "0" "0" "SEARCH TABLE traces USING PRIMARY KEY (name>? AND name<?)"
"0" "0" "0" "SEARCH TABLE traces USING PRIMARY KEY (name>? AND name<?)"
Or-clause optimization, is explained here:-
3.0 OR optimizations (same as first document). However there are lots of mights, rather I think that link as provided in the EXPLAIN QUERY PLAN to 1.8. OR-Connected Terms In The WHERE Clause is more pertinent, this includes :-
1.8. OR-Connected Terms In The WHERE Clause
Multi-column indices only work if the constraint terms in the WHERE
clause of the query are connected by AND. So Idx3 and Idx4 are helpful
when the search is for items that are both Oranges and grown in
California, but neither index would be that useful if we wanted all
items that were either oranges or are grown in California.
SELECT price FROM FruitsForSale WHERE fruit='Orange' OR state='CA';
When confronted with OR-connected terms in a WHERE clause, SQLite
examines each OR term separately and tries to use an index to find the
rowids associated with each term. It then takes the union of the
resulting rowid sets to find the end result. The following figure
illustrates this process:
The diagram above implies that SQLite computes all of the rowids
first and then combines them with a union operation before starting to
do rowid lookups on the original table. In reality, the rowid lookups
are interspersed with rowid computations. SQLite uses one index at a
time to find rowids while remembering which rowids it has seen before
so as to avoid duplicates. That is just an implementation detail,
though. The diagram, while not 100% accurate, provides a good overview
of what is happening.
In order for the OR-by-UNION technique shown above to be useful, there
must be an index available that helps resolve every OR-connected term
in the WHERE clause. If even a single OR-connected term is not
indexed, then a full table scan would have to be done in order to find
the rowids generated by the one term, and if SQLite has to do a full
table scan, it might as well do it on the original table and get all
of the results in a single pass without having to mess with union
operations and follow-on binary searches.
One can see how the OR-by-UNION technique could also be leveraged to
use multiple indices on queries where the WHERE clause has terms
connected by AND, by using an intersect operator in place of union.
Many SQL database engines will do just that. But the performance gain
over using just a single index is slight and so SQLite does not
implement that technique at this time. However, a future version
SQLite might be enhanced to support AND-by-INTERSECT.
Another consideration is 4.0 The LIKE optimization. However, I believe this is on a per LIKE clause basis only.

Related

Limit fulltext search in MariaDB (innodb)

I'm having trouble making a search on a fairly large (5 million entries) table fast.
This is innodb on MariaDB (10.4.25).
Structure of the table my_table is like so:
id
text
1
some text
2
some more text
I now have a fulltext index on "text" and search for:
SELECT id FROM my_table WHERE MATCH ('text') AGAINST ("some* tex*" IN BOOLEAN MODE);
This is not super slow but can yield to millions of results. Retrieving them in my Java application takes forever but I need the matching ids.
Therefore, I wanted to limit the number already by the ids I know can only be relevant and tried something like this (id is primary index):
SELECT id FROM my_table WHERE id IN (1,2) AND MATCH ('text') AGAINST ("some* tex*" IN BOOLEAN MODE);
hoping that it would first limit to the 2 ids and then apply the fulltext search and give me the two results super quickly. Alas, that's not what happened and I don't understand why.
How can I limit the query if I already know some ids to only search through those AND make the query faster by doing so?
When you use a FULLTEXT (or SPATIAL) index together with some 'regular' index, the Optimizer assumes that the former will run faster, so it does that first.
Furthermore, it is nontrivial (maybe impossible) to run MATCH against a subset of a table.
Both of those conspire to say that the MATCH will happen first. (Of course, you were hoping to do the opposite.)
Is there a workaround? I doubt it. Especially if there a lot of rows with words starting with 'some' or 'tex'.
One thing to try is "+":
MATCH ('text') AGAINST ("+some* +tex*" IN BOOLEAN MODE);
Please report back whether this helped.
Hmmmm... Perhaps you want
MATCH (`text`) -- this
MATCH ('text') -- NOT this
There are two features in MariaDB:
max time spent in query
max number of rows accessed (may not apply to FULLTEXT)

What is the point of Snowflake's Unique constraint?

Snowflake offers a Unique constraint but doesn't actually enforce it. I have an example below showing that with a test table.
What is the point, what value does the constraint add?
What workarounds do people use to avoid duplicates? I can perform a query before every insert but it seems like unnecessary usage.
CREATE OR REPLACE TABLE dbo.Test
(
"A" INT NOT NULL UNIQUE,
"B" STRING NOT NULL
);
INSERT INTO dbo.Test
VALUES (0, 'ABC');
INSERT INTO dbo.Test
VALUES (0, 'DEF');
SELECT *
FROM dbo.Test;
1. A, B
2. 0, ABC
3. 0, DEF
for one, Snowflake is not alone in this world. So data gets imported and exported, and while Snowflake does not enforce the constraints, some other systems might and this way they won't get lost while travelling through Snowflake
for other, it's also informational for the data analytical tools like already mentioned in the link Kirby provided
please remember, that execution is consecutive, so running a check before every query will still get you duplicates at high concurrency. To avoid duplicates fully you need to either run merges (which is admittedly going to be slower) or manually delete the "excessive" data after it's been loaded

SQLite: re-arrange physical position of rows inside file

My problem is that my querys are too slow.
I have a fairly large sqlite database. The table is:
CREATE TABLE results (
timestamp TEXT,
name TEXT,
result float,
)
(I know that timestamps as TEXT is not optimal, but please ignore that for the purposes of this question. I'll have to fix that when I have the time)
"name" is a category. This calculation holds the results of a calculation that has to be done at each timestamp for all "name"s. So the inserts are done at equal-timestamps, but the querys will be done at equal-names (i.e. I want given a name, get its time series), like:
SELECT timestamp,result WHERE name='some_name';
Now, the way I'm doing things now is to have no indexes, calculate all results, then create an index on name CREATE INDEX index_name ON results (name). The reasoning is that I don't need the index when I'm inserting, but having the index will make querys on the index really fast.
But it's not. The database is fairly large. It has about half a million timestamps, and for each timestamp I have about 1000 names.
I suspect, although I'm not sure, that the reason why it's slow is that every though I've indexed the names, they're still scattered all around the physical disk. Something like:
timestamp1,name1,result
timestamp1,name2,result
timestamp1,name3,result
...
timestamp1,name999,result
timestamp1,name1000,result
timestamp2,name1,result
timestamp2,name2,result
etc...
I'm sure this is slower to query with NAME='some_name' than if the rows were physically ordered as:
timestamp1,name1,result
timestamp2,name1,result
timestamp3,name1,result
...
timestamp499997,name1000,result
timestamp499998,name1000,result
timestamp499999,name1000,result
timestamp500000,namee1000,result
etc...
So, how do I tell SQLite that the order in which I'd like the rows in disk isn't the one they were written in?
UPDATE: I'm further convinced that the slowness in doing a select with such an index comes exclusively from non-contiguous disk access. Doing SELECT * FROM results WHERE name=<something_that_doesnt_exist> immediately returns zero results. This suggests that it's not finding the names that's slow, it's actually reading them from the disk.
Normal sqlite tables have, as a primary key, a 64-bit integer (Known as rowid and a few other aliases). That determines the order that rows are stored in a B*-tree (Which puts all actual data in leaf node pages). You can change this with a WITHOUT ROWID table, but that requires an explicit primary key which is used to place rows in a B-tree. So if every row's (name, timestamp) columns make a unique value, that's a possibility that will leave all rows with the same name on a smaller set of pages instead of scattered all over.
You'd want the composite PK to be in that order if you're searching for a particular name most of the time, so something like:
CREATE TABLE results (
timestamp TEXT
, name TEXT
, result REAL
, PRIMARY KEY (name, timestamp)
) WITHOUT ROWID
(And of course not bothering with a second index on name.) The tradeoff is that inserts are likely to be slower as the chances of needing to split a page in the B-tree go up.
Some pragmas worth looking into to tune things:
cache_size
mmap_size
optimize (After creating your index; also consider building sqlite with SQLITE_ENABLE_STAT4.)
Since you don't have an INTEGER PRIMARY KEY, consider VACUUM after deleting a lot of rows if you ever do that.

Is SQLite "Insert or Replace" slower than just "Insert"?

I am copying millions of rows to a table in another database. I am doing a few things with the data in-between and have duplicates on a certain column that is used as a key in the destination table. Ignoring all the other solutions to fix this, I am testing out using "Insert or Replace" and so far processing is going smooth, but I am not sure whether this is faster than a normal "Insert" (given a case where there are no PK duplicates)?
The OR REPLACE clause works only if there is some UNIQUE (or PRIMARY KEY) constraint that could be violated.
This means that the database always has to check whether there is a duplicate, the only difference is what happens when a duplicate is found: report an error, or delete the old row.

Hierarchical Database Select / Insert Statement (SQL Server)

I have recently stumbled upon a problem with selecting relationship details from a 1 table and inserting into another table, i hope someone can help.
I have a table structure as follows:
ID (PK) Name ParentID<br>
1 Myname 0<br>
2 nametwo 1<br>
3 namethree 2
e.g
This is the table i need to select from and get all the relationship data. As there could be unlimited number of sub links (is there a function i can create for this to create the loop ?)
Then once i have all the data i need to insert into another table and the ID's will now have to change as the id's must go in order (e.g. i cannot have id "2" be a sub of 3 for example), i am hoping i can use the same function for selecting to do the inserting.
If you are using SQL Server 2005 or above, you may use recursive queries to get your information. Here is an example:
With tree (id, Name, ParentID, [level])
As (
Select id, Name, ParentID, 1
From [myTable]
Where ParentID = 0
Union All
Select child.id
,child.Name
,child.ParentID
,parent.[level] + 1 As [level]
From [myTable] As [child]
Inner Join [tree] As [parent]
On [child].ParentID = [parent].id)
Select * From [tree];
This query will return the row requested by the first portion (Where ParentID = 0) and all sub-rows recursively. Does this help you?
I'm not sure I understand what you want to have happen with your insert. Can you provide more information in terms of the expected result when you are done?
Good luck!
For the retrieval part, you can take a look at Common Table Expression. This feature can provide recursive operation using SQL.
For the insertion part, you can use the CTE above to regenerate the ID, and insert accordingly.
I hope this URL helps Self-Joins in SQL
This is the problem of finding the transitive closure of a graph in sql. SQL does not support this directly, which leaves you with three common strategies:
use a vendor specific SQL extension
store the Materialized Path from the root to the given node in each row
store the Nested Sets, that is the interval covered by the subtree rooted at a given node when nodes are labeled depth first
The first option is straightforward, and if you don't need database portability is probably the best. The second and third options have the advantage of being plain SQL, but require maintaining some de-normalized state. Updating a table that uses materialized paths is simple, but for fast queries your database must support indexes for prefix queries on string values. Nested sets avoid needing any string indexing features, but can require updating a lot of rows as you insert or remove nodes.
If you're fine with always using MSSQL, I'd use the vendor specific option Adrian mentioned.

Resources