Reading through the sqlite documentation I found the following function:
http://www.sqlite.org/lang_corefunc.html#likelihood
The likelihood(X,Y) function returns argument X unchanged. The value Y in likelihood(X,Y)
must be a floating point constant between 0.0 and 1.0, inclusive. The likelihood(X) function
is a no-op that the code generator optimizes away so that it consumes no CPU cycles during
run-time (that is, during calls to sqlite3_step()). The purpose of the likelihood(X,Y)
function is to provide a hint to the query planner that the argument X is a boolean that is
true with a probability of approximately Y. The unlikely(X) function is short-hand for
likelihood(X,0.0625).
Assuming that i know that 1 will return 75% of the time, how would:
select likelihood(x,.75)
help the query optimizer?
The original example was this:
Consider the following schema and query:
CREATE TABLE composer(
cid INTEGER PRIMARY KEY,
cname TEXT
);
CREATE TABLE album(
aid INTEGER PRIMARY KEY,
aname TEXT
);
CREATE TABLE track(
tid INTEGER PRIMARY KEY,
cid INTEGER REFERENCES composer,
aid INTEGER REFERENCES album,
title TEXT
);
CREATE INDEX track_i1 ON track(cid);
CREATE INDEX track_i2 ON track(aid);
SELECT DISTINCT aname
FROM album, composer, track
WHERE cname LIKE '%bach%'
AND composer.cid=track.cid
AND album.aid=track.aid;
The schema is for a (simplified) music catalog application, though similar kinds of schemas come up in other situations. There is a large number of albums. Each album contains one or more tracks. Each track has a composer. Each composer might be associated with multiple tracks.
The query asks for the name of every album that contains a track with a composer whose name matches '%bach%'.
The query planner needs to choose among several alternative algorithms for this query. The best choices hinges on how well the expression "cname LIKE '%bach%'" filters the results. Let's give this expression a "filter value" which is a number between 1.0 and 0.0. A value of 1.0 means that cname LIKE '%bach%' is true for every row in the composer table. A value of 0.0 means the expression is never true.
The current query planner (in version 3.8.0) assumes a filter value of 1.0. In other words, it assumes that the expression is always true. The planner is assuming the worst case so that it will pick a plan that minimizes worst case run-time. That's a safe approach, but it is not optimal. The plan chosen for a filter of 1.0 is track-album-composer. That means that the "track" table is in the outer loop. For each row of track, an indexed lookup occurs on album. And then an indexed lookup occurs on composer, then the LIKE expression is run to see if the album name should be output.
A better plan would be track-composer-album. This second plan avoids the album lookup if the LIKE expression is false. The current planner would choose this second algorithm if the filter value was just slightly less than 1.0. Say 0.99. In other words, if the planner thought that the LIKE expression would be false for 1 out of every 100 rows, then it would choose the second plan. That is the correct (fastest) choice for when the filter value is large.
But in the common case of a music library, the filter value is probably much closer to 0.0 than it is to 1.0. In other words, the string "bach" is unlikely to be found in most composer names. And for values near 0.0, the best plan is composer-track-album. The composer-track-album plan is to scan the composer table once looking for entries that match '%bach%" and for each matching entry use indices to look up the track and then the album. The current 3.8.0 query planner chooses this third plan when the filter value is less than about 0.1.
The likelihood functions gives the database a (hopefully) better estimate of the selectivity of a filter.
With the example query, it would look like this:
SELECT DISTINCT aname
FROM album, composer, track
WHERE likelihood(cname LIKE '%bach%', 0.05)
AND composer.cid=track.cid
AND album.aid=track.aid;
Related
I've been reading a DynamoDB docs and was unable to understand if it does make sense to query on Global Secondary Index with a usage of 'contains' operator.
My problem is as follows: my dynamoDB document has a list of embedded objects, every object has a 'code' field which is unique:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
]
}
I want to be able to get all documents that contain entities with entity.code = X.
For this purpose I'm considering adding a Global Secondary Index that would contain all entity.codes that are present in current db document separated by a comma. So the example above would look like:
{
"entities":[
{"code":"entity1Code", "name":"entity1Name"},
{"code":"entity2Code", "name":"entity2Name"}
],
"entitiesGlobalSecondaryIndex":"entityCode1,entityCode2"
}
And then I would like to apply filter expression on entitiesGlobalSecondaryIndex something like: entitiesGlobalSecondaryIndex contains entityCode1.
Would this be efficient or using global secondary index does not make sense in this way and DynamoDB will simply check the condition against every document which is similar so scan?
Any help is very appreciated,
Thanks
The contains operator of a query cannot be run on a partition Key. In order for a query to use any sort of operators (contains, begins with, > < ect...) you must have a range attributes- aka your Sort Key.
You can very well set up a GSI with some value as your PK and this code as your SK. However, GSIs are replication of the table - there is a slight potential for the data ina GSI to lag behind that of the master copy. If the query you're doing against this GSI isn't very often, then you're probably safe from that.
However. If you are trying to do this to the entire table at once then it's no better than a scan.
If what you need is a specific Code to return all its documents at once, then you could do a GSI with that as the PK. If you add a date field as the SK of this GSI it would even be time sorted. If you query against that code in that index, you'll get every single one of them.
Since you may have multiple codes, if they aren't too many per document, you maybe could use a Sparse Index - if you have an entity with code "AAAA" then you also have an attribute named AAAA (or AAAAflag or something.) It is always null/does not exist Unless the entities contains that code. If you do a GSI on this AAAflag attribute, it will only contain documents that contain that entity code, and ignore all where this attribute does not exist on a given document. This may work for you if you can also provide a good PK on this to keep the numbers well partitioned and if you don't have too many codes.
Filter expressions by the way are different than all of the above. Filter expressions are run on tbe data that would be returned, after it is already read out of the table. This is useful I'd you have a multi access pattern setup, but don't want a particular call to get all the documents associated with a particular PK - in the interests of keeping the data your code is working with concise. The query with a filter expression still retrieves everything from that query, but only presents what makes it past the filter.
If are only querying against a particular PK at any given time and you want to know if it contains any entities of x, then a Filter expressions would work perfectly. Of course, this is only per PK and not for your entire table.
If all you need is numbers, then you could do a count attribute on the document, or a meta document on that partition that contains these values and could be queried directly.
Lastly, and I have no idea if this would work or not, if your entities attribute is a map type you might very well be able to filter against entities code - and maybe even with entities.code.contains(value) if it was an SK - but I do not know if this is possible or not
I have a query that uses WHERE id IN (1,2,3,...) where the list (1,2,3,...) is dynamically generated from an array of integers (not using parameters). Now I have a particular query that takes roughly 500ms with 26623 ids but 50s (100x slower) with 26624 ids.
I couldn't find anything that looks related in https://sqlite.org/limits.html
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
AND flows.id IN (1,2,3) /* <=== here more than 26623 IDs make it super slow */
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
Before I try to make that reproducible in isolate (e.g. search_params is a custom virtual table), does anyone know what limitation I might be running into? It's not the number of IDs per se, since a different query runs just fine with the same IDs.
SQLite version 3.36.0 via better-sqlite3 (Node.js) with a readonly database. The only pragma I use is journal_mode = WAL.
Compiled with (https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/compilation.md#bundled-configuration):
SQLITE_DQS=0
SQLITE_LIKE_DOESNT_MATCH_BLOBS
SQLITE_THREADSAFE=2
SQLITE_USE_URI=0
SQLITE_DEFAULT_MEMSTATUS=0
SQLITE_OMIT_DEPRECATED
SQLITE_OMIT_GET_TABLE
SQLITE_OMIT_TCL_VARIABLE
SQLITE_OMIT_PROGRESS_CALLBACK
SQLITE_OMIT_SHARED_CACHE
SQLITE_TRACE_SIZE_LIMIT=32
SQLITE_DEFAULT_CACHE_SIZE=-16000
SQLITE_DEFAULT_FOREIGN_KEYS=1
SQLITE_DEFAULT_WAL_SYNCHRONOUS=1
SQLITE_ENABLE_MATH_FUNCTIONS
SQLITE_ENABLE_DESERIALIZE
SQLITE_ENABLE_COLUMN_METADATA
SQLITE_ENABLE_UPDATE_DELETE_LIMIT
SQLITE_ENABLE_STAT4
SQLITE_ENABLE_FTS3_PARENTHESIS
SQLITE_ENABLE_FTS3
SQLITE_ENABLE_FTS4
SQLITE_ENABLE_FTS5
SQLITE_ENABLE_JSON1
SQLITE_ENABLE_RTREE
SQLITE_ENABLE_GEOPOLY
SQLITE_INTROSPECTION_PRAGMAS
SQLITE_SOUNDEX
HAVE_STDINT_H=1
HAVE_INT8_T=1
HAVE_INT16_T=1
HAVE_INT32_T=1
HAVE_UINT8_T=1
HAVE_UINT16_T=1
HAVE_UINT32_T=1
Here's the answer from the SQLite forums. Essentially this is a combination of how the query planner handles IN literals and what cost my virtual table estimates. That means I'm running into the exact moment when the query planner makes a different decision.
SQLite NGQP is a cost based query planner. The IN () operator with a list of literal values gets implemented as a kind of temporary table; sometimes SQLite decides to create an index and do lookups, other times it decides to use that table as the outermost loop of the query.
EXPLAIN QUERY PLAN should show that in a more concise manner.
If compiled in DEBUG mode mith WHERETRACE enabled, the .wheretrace command will show how SQLite NGQP reaches its plan. Essential input is the return values from the xBestIndex method of your virtual table, especially the "number of rows" and the "estimated cost". It is paramount to deliver accurate estimates. Cost should reflect processing cost relative to SQLite native tables.
Note that you can name the IN table by making it a CTE and CROSS JOIN to force the query plan that works fast.
https://sqlite.org/forum/forumpost/a3d68ed8b40cf583?t=h
The workaround I use is json_each and serialize the array of integers into a JSON string. In my particular use-case this has some other benefits as well (e.g. I can bind a single parameter and re-use the query with any number of IDs), so I don't mind doing that:
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
-AND flows.id IN (1,2,3)
+AND flows.id IN (SELECT value FROM json_each('[1,2,3]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
I also know that the generic virtual table implementation of better-sqlite3 makes a trade-off between being easy to use (it's ridiculously easy) and achieving maximum performance.
On a 3-nodes Couchbase Community Edition 5.0.1 build 5003 cluster, couchbase indicates that it contains 12268503 items. However, when counting the ids, the result is 6132875.
What are the factors that can make the item count differ from the item id count in couchbase?
More precisely, when the following N1QL query is executed on a buckets - say Product
SELECT count(1) FROM Product
It gives
12268503
While when the count is made on the item ids
SELECT count(META(Product).id) FROM Product
It returns:
6132875
That is, the number of ids is less than 50% of the number of items.
Also, they was no operation (0 ops/s) on the bucket for several hours, which excludes the possibility of the primary index not catching up due to a traffic peak.
I pored through the couchbase blog & doc without finding any clues as for this count difference. Any pointer is much appreciated.
If the query has no predicate/no join and projection has single expression count(*), count(constant) the query gets the results from bucket stats and provide the info (takes sub milli seconds).
SELECT count(*) FROM Product;
SELECT count(1) FROM Product;
The following is almost similar but COUNT argument is expression so it has to use index and do aggregation (As in this case document key which unique and must be string, optimizer should have considered as previous approach, As of now no optimization)
SELECT count(META(Product).id) FROM Product
In second case it uses index, Your index might have pending items and not caught up. Try use scan_consistency. Check the index stats start with.
I am faced with a database (sqlite specifically) query that I am not sure how to approach.
I'm looking for all tuples that have 1-n word matches between their 'name' attribute and a constant. Sorted in descending order.
For example it is a database containing food items. If the constant is "Maranatha Natural Almond Butter 26oz Lightly Roasted" I would like any tuple in the database that contains atleast one of the words in that constant to be returned. For example "Almond Butter Natural" would come before "Maranatha Natural" which would come before "Almond", etc.
Essentially as long as there is one intersecting word between the tuples attribute and the constant it qualifies a match.
Matching words is what SQLite's full-text search extension is designed for. Please read that page to see how SQLite must be compiled and what is possible, but I'll add some remarks:
Simple matching is done with a query like:
SELECT * FROM foods_fts_tab WHERE name MATCH 'Maranatha Natural Almond etc.'
This will just return all records where at least one word matches.
You can weight the matches with information returned by the auxiliary functions.
For example,
SELECT ... ORDER BY length(offsets(foods_fts_tab)) DESC
will sort by the number of matching words.
You are asking for the number of word matches, but real search engines also use other information to compute relevancy scores. See the matchinfo() function in section 4.3 and the example in appendix A.
What are they and how do they work?
Where are they used?
When should I (not) use them?
I've heard the word over and over again, yet I don't know its exact meaning.
What I heard is that they allow associative arrays by sending the array key through a hash function that converts it into an int and then uses a regular array. Am I right with that?
(Notice: This is not my homework; I go too school but they teach us only the BASICs in informatics)
Wikipedia seems to have a pretty nice answer to what they are.
You should use them when you want to look up values by some index.
As for when you shouldn't use them... when you don't want to look up values by some index (for example, if all you want to ever do is iterate over them.)
You've about got it. They're a very good way of mapping from arbitrary things (keys) to arbitrary things (values). The idea is that you apply a function (a hash function) that translates the key to an index into the array where you store the values; the hash function's speed is typically linear in the size of the key, which is great when key sizes are much smaller than the number of entries (i.e., the typical case).
The tricky bit is that hash functions are usually imperfect. (Perfect hash functions exist, but tend to be very specific to particular applications and particular datasets; they're hardly ever worthwhile.) There are two approaches to dealing with this, and each requires storing the key with the value: one (open addressing) is to use a pre-determined pattern to look onward from the location in the array with the hash for somewhere that is free, the other (chaining) is to store a linked list hanging off each entry in the array (so you do a linear lookup over what is hopefully a short list). The cases of production code where I've read the source code have all used chaining with dynamic rebuilding of the hash table when the load factor is excessive.
Good hash functions are one way functions that allow you to create a distributed value from any given input. Therefore, you will get somewhat unique values for each input value. They are also repeatable, such that any input will always generate the same output.
An example of a good hash function is SHA1 or SHA256.
Let's say that you have a database table of users. The columns are id, last_name, first_name, telephone_number, and address.
While any of these columns could have duplicates, let's assume that no rows are exactly the same.
In this case, id is simply a unique primary key of our making (a surrogate key). The id field doesn't actually contain any user data because we couldn't find a natural key that was unique for users, but we use the id field for building foreign key relationships with other tables.
We could look up the user record like this from our database:
SELECT * FROM users
WHERE last_name = 'Adams'
AND first_name = 'Marcus'
AND address = '1234 Main St'
AND telephone_number = '555-1212';
We have to search through 4 different columns, using 4 different indexes, to find my record.
However, you could create a new "hash" column, and store the hash value of all four columns combined.
String myHash = myHashFunction("Marcus" + "Adams" + "1234 Main St" + "555-1212");
You might get a hash value like AE32ABC31234CAD984EA8.
You store this hash value as a column in the database and index on that. You now only have to search one index.
SELECT * FROM users
WHERE hash_value = 'AE32ABC31234CAD984EA8';
Once we have the id for the requested user, we can use that value to look up related data in other tables.
The idea is that the hash function offloads work from the database server.
Collisions are not likely. If two users have the same hash, it's most likely that they have duplicate data.