sqlite Query optimisation - sqlite

The query
SELECT * FROM Table WHERE Path LIKE 'geo-Africa-Egypt-%'
can be optimized as:
SELECT * FROM Table WHERE Path >= 'geo-Africa-Egypt-' AND Path < 'geo-Africa-Egypt-zzz'
But how can be this done:
select * from foodDb where Food LIKE '%apples%";
how this can be optimized?

One option is redundant data. If you're querying a lot for some fixed set of strings occuring in the middle of some column, add another column that contains the information whether a particular string can be found in the other column.
Another option, for arbitrary but still tokenizable strings is to create a dictionary table where you have the tokens (e.g. apples) and foreign key references to the actual table where the token occurs.
In general, sqlite is by design not very good at full text searches.

It would surprise me if it was faster, but you could try GLOB instead of LIKE and compare;
SELECT * FROM foodDb WHERE Food GLOB '*apples*';

Related

Do not fail on missing column in a SQLLite query

I have a simple query like this:
SELECT * FROM CUSTOMERS WHERE CUSTID LIKE '~' AND BANKNO LIKE '~'
The problem is, the customers-table might or might not contain the BANKNO column depending on circumstances I've no control over. If however BANKNO is not a column in CUSTOMERS, this query fails.
So my question is: it is possible to test if the BANKNO column exists and if so, to include it in the query and if not to exclude this column?
The query really has to be flexible.
A non-existent column in a SELECT to sqlite3 will always fail.
One option might be to put the "full" sql in a try block, and if it errors, execute the other sql.
Or, you could query PRAGMA table_info('CUSTOMERS') and interrogate the result to see if a column in question is in the database. Find the sqlite doc here https://www.sqlite.org/pragma.html#pragma_table_info.
I'm sure there are other options, but the bottom line is you need to know before the sql is executed that it contains only valid column names.

How to use dynamic values while executing SQL scripts in R

My R workflow now involves dealing with a lot of queries (RPostgreSQL library). I really want to make code easy to maintain and manage in the future.
I started loading large queries from separate .SQL files (this helped) and it worked great.
Then I started using interpolated values (that helped) which means that I can write
SELECT * FROM table WHERE value = ?my_value;
and (after loading it into R) interpolate it using sqlInterpolate(ANSI(), query, value = "stackoverflow").
What happens now is I want to use something like this
SELECT count(*) FROM ?my_table;
but how can I make it work? sqlInterpolate() only interpolates safely by default. Is there a workaround?
Thanks
In ?DBI::SQL, you can read:
By default, any user supplied input to a query should be escaped using
either dbQuoteIdentifier() or dbQuoteString() depending on whether it
refers to a table or variable name, or is a literal string.
Also, on this page:
You may also need dbQuoteIdentifier() if you are creating tables or
relying on user input to choose which column to filter on.
So you can use:
sqlInterpolate(ANSI(),
"SELECT count(*) FROM ?my_table",
my_table = dbQuoteIdentifier(ANSI(), "table_name"))
# <SQL> SELECT count(*) FROM "table_name"
sqlInterpolate() is for substituting values only, not other components like table names. You could use other templating frameworks such as brew or whisker.

UNION of tables using bigquery LegacySQL

I'm trying without luck to do a query to retrieve the union two tables of events using legacySQL, as standardSQL is not yet supported on data studio.
In standardSQL that would be something like:
SELECT
*
FROM
`com_myapp_ANDROID.app_events_*`,
`com_myapp_IOS.app_events_*`
However, in legacySQL I get an error when trying to refer app_events_*. How do I include all the tables of my events, so I can filter it afterwards on data studio if I can't use the wildcard?
I've tried something like:
select * from (TABLE_QUERY(com_myapp_ANDROID, 'table_id CONTAINS "app_events_"'))
But not sure if this is the right approach, I get:
Cannot output multiple independently repeated fields at the same time.
Found user_dim_user_properties_value_index and event_dim_date
Edit: in the end this is the result of the query, as you can't use directly FLATTEN with TABLE_QUERY:
select
*
from
FLATTEN((SELECT * FROM TABLE_QUERY(com_myapp_ANDROID, 'table_id CONTAINS "app_events"')),user_dim.user_properties),
FLATTEN((SELECT * FROM TABLE_QUERY(com_myapp_IOS, 'table_id CONTAINS "app_events"')),user_dim.user_properties)
Table wildcards don't work in legacy SQL as you have guessed so you have to use the TABLE_QUERY() function.
Your approach is right but the first parameter in the TABLE_QUERY function should be the dataset name not the first part of the table name. Assuming your dataset name is app_events that would look like this:
TABLE_QUERY(app_events,'table_id CONTAINS "app_events"')
In legacySQL the union table operator is comma
select * from [table1],[table2]
For TABLE_QUERY you would include the dataset name as first param, and the expression for the second
select * from (TABLE_QUERY([dataset], 'table_id CONTAINS "event"'))
to read more how to debug TABLE_QUERY read this linked answer
The Web UI automatically flattens you the results, but when there are independent repeated fields you need to flatten with the FLATTEN wrapper.
It takes two params, table, and repeated field eg: FLATTEN(table, tags)
Also if TABLE_QUERY is involved you need to subselect probably like
select
*
from
FLATTEN((SELECT * FROM TABLE_QUERY(com_myapp_ANDROID, 'table_id CONTAINS "app_events"')),user_dim.user_properties)
That particular issue you are experiencing is not UNION related - you will see same error message even with just one table if the table has multiple independently repeated fields and you are trying to output them at once. This scenario is specific to Legacy SQL and can be resolved with use of FLATTEN clause
At the same time, most likely you don't actually mean to use SELECT * which cause those repeated fields to be in output all at the same time. If you can narrow down your output list - you have slight chance to address it - but if still few independently repeated fields are in output - you can use FLATTEN technique

Fast search on a blob starting bytes in SQLite

Is there a way to index blob fields and have the index used for beginning of blob searches?
Currently I have hashes stored as hexadecimal in text fields.
These hashes in hexadecimal form are 32 characters long, and form the bulk of the data in the database.
Problem is, they are often searched by their starting bytes, as in
select * from mytable where hash like '00a1b2%'
I would like to store them as blobs, as this saves about 30% of the database size. However while
select * from mytable where hex(hash) like '00a1b2%'
works, it's also much slower and does not seem to use the index.
Searching for exact blob matches does use the index, so the index is working.
Is there a way to perform a search on a blob start (with binary/memcmp "collation") that would use the index?
I also tried substr(), it's apparently faster than hex() but still not indexed
select * from mytable where substr(hash, 1, 6) = x'00a1b2'
To be able to use an index for LIKE, the table column must have TEXT affinity, and the index must be case insensitive:
CREATE TABLE mytable(... hash TEXT, ...);
CREATE INDEX hash_index ON mytable(hash COLLATE NOCASE);
Functions like hex or substr prevent usage of indexes.
Blobs can be indexed and compared like other types.
This allows you to express a prefix search with two comparisons:
SELECT * FROM mytable WHERE hash >= x'00a1b2' AND hash < x'00a1b3'

SQLite data retrieve with select taking too long

I have created a table with sqlite for my corona/lua app. It's a hashtable with ~=700 000 values.The table has two columns, which are the hashcode (a string), and the value (another string). During the program I need to get data several times by providing the hashcode.
I'm using something like this code to get the data:
for p in db:nrows([[SELECT * FROM test WHERE id=']].."hashcode"..[[';]]) do
print(p)
-- p = returned value --
end
This statement is though taking insanely too much time to perform
thanks,
Edit:
Success!
the mistake was with the primare key thing.I set the hashcode as the primary key like below and the retrieve time whent to normal:
CREATE TABLE IF NOT EXISTS test (id STRING PRIMARY KEY , array);
I also prepared the statements in advance as you said:
stmt = db:prepare("SELECT * FROM test WHERE id = ?;")
[...]
stmt:bind(1,s)
for p in stmt:nrows() do
The only problem was that the db file size,that was around 18 MB, went to 29,5 MB
You should create the table with id as a unique primary key; this will automatically make an index.
create table if not exists test
(
id text primary key,
val text
);
You should not construct statements using string concatenation; this is a security issue so avoid getting in this habit. Also, you should prepare statements in advance, at program initialization, and run the prepared statements.
Something like this... initially:
hashcode_query_stmt = db:prepare("SELECT * FROM test WHERE id = ?;")
then for each use:
hashcode_query_stmt:bind_values(hashcode)
for p in hashcode_query_stmt:urows() do ... end
Ensure that there is an index on the id/hashcode column? Without one such queries will be slow, slow, slow. This index should probably be unique.
If only selecting the value/hashcode (SELECT value FROM ..), it may be beneficial to have a covering index over (id, value) as that can avoid additional seeking to the row data (see SQLite Query Planning). Try it with and without such a covering index.
Also, it may be worthwhile to employ caching if the same hashcodes are queried multiple times.
As already stated, get sure you have an index on ID.
If you can't change table schema now, you can add a index ad hoc:
CREATE INDEX test_id ON test (id);
About hashes: if you are computing hashes in your software to speed up searches, don't!
SQLite will use your supplied hashes as any regular string/blob. Also, RDBMS are optimized for efficient searching, which may be greatly improved with indexes.
Unless your hashing to save space, you are wasting processor time computing hashes in your application.

Resources