How to determine position of specific character/string in SQLite string column value? - sqlite

I have values in a SQLite table* that contain a number of strings, of different lengths, joined by periods, something like this:
SomeApp.SomeNameSpace.InterestingString.NotInteresting
SomeApp.OtherNameSpace.WantThisOne.ReallyQuiteDull
SomeApp.OtherNameSpace.WantThisOne.AlsoDull
SomeApp.DifferentNameSpace.AlwaysWorthALook.LittleValue
I'd like to extract (in this case) the third period-delimited substring so I could write something like
SELECT interesting_string, COUNT(*)
FROM ( SELECT third_part_of_period_delimited_string(name) interesting_string )
GROUP BY interesting_string;
Obviously I can do this any number of ways programmatically; I'm wondering if there's any way to achieve this in a SQLite SELECT query?
* It's a SharpDevelop Profiler database, if anyone's curious

No.
You can, as you mention, work with the strings after you have selected them from the database. Or you can split them up into separate columns when they are stored.
If you do not have access to the code that is storing the data, you might want to consider reading the data in its entirety, splitting the strings and storing the split out tokens in separate columns in a new table. If the data is not too large, you might look at storing this table in a new memory database to give excellent performance.
Whether this is worthwhile depends on whether one pass to split the data strings can be made use of many times. If the data is constantly changing, then this scheme would probably not work well.

Related

Read all data from a table of 10k rows in a single request

Could there be problems in reading all the data of a 10k rows table in a single request?
It would be a read only request.
I would like to do it because I want to perform some queries on the array, and from the documentation I can’t find a way to do it directly with Pact.
No there shouldn't be. Read only queries are "free" atm.
You can do it in two ways
Do a select query which will always evaluate true
Get all the keys (i.e. unique ids in the table) via (keys your-table-name) and then have a separate method which returns data for a list of ids.
But do consider using select statements to help filter out your data during the query as this could be easier than you doing it yourself.
Pact will check arrays like any other property, but you should ask yourself the question - do you need to test all 10k records or just a representative sample of them (the answer should in most cases be the latter).
You should also consider:
Do you need to exact match? (if so, the consumer and provider must have exactly the same data - not recommended)
Can you use matchers to check the shape of the items in the array

Full-Text search in Sql server with multiple tables and ranking

We have a website which is running on DNN 7.1 with SQL server. We implemented full text search to show search results. We need to search several tables and show the results to the user. Right now the implementation is user enters search word(s) and clicks search, the code behind creates several threads to search different tables, and merge the data. Currently we are using contains predicate, issue with this is, there is no ranking and sometimes after the merge the results on the first page are not the best matches. I thought that I can use containstable and order the results by ranking but I read ranking doesn't have any meaning by itself, it merely tells which one best matches in the current resultset. But in my scenario I have multiple resultsets, how will I know which are best matches across multiple resultsets. Or am I going about this wrong way? What is a good way to handle this scenario? We need to improve the response time along with better results. Any help is greatly appreciated.
this is how we implemented full text searching across multiple tables:
1) create a new table to store the primary keys of the other tables in each column, another column to store the string concatenated values of all the search fields from each table, and another column to store the checksum value of the concatenated values.
2) implement the FTI on this new table, and create a job that regularly syncrhonizes/updates the concatenated search values only if the binary_checksum value is different
3) use the contains predicate on this new table and based on the results, join back to their corresponding tables based on the primary keys returned.

When to include an index (automated heuristic)

I have a piece of software which takes in a database, and uses it to produce graphs based on what the user wants (primarily queries of the form SELECT AVG(<input1>) AS x, AVG(<intput2>) as y FROM <input3> WHERE <key> IN (<vals..> AND ...). This works nicely.
I have a simple script that is passed a (often large) number of files, each describing a row
name=foo
x=12
y=23.4
....... etc.......
The script goes through each file, saving the variable names, and an INSERT query for each. It then loads the variable names, sort | uniq's them, and makes a CREATE TABLE statement out of them (sqlite, amusingly enough, is ok with having all columns be NUMERIC, even if they actually end up containing text data). Once this is done, it then executes the INSERTS (in a single transaction, otherwise it would take ages).
To improve performance, I added an basic index on each row. However, this increases database size somewhat significantly, and only provides a moderate improvement.
Data comes in three basic types:
single value, indicating things like program version, etc.
a few values (<10), indicating things like input parameters used
many values (>1000), primarily output data.
The first type obviously shouldn't need an index, since it will never be sorted upon.
The second type should have an index, because it will commonly be filtered by.
The third type probably shouldn't need an index, because it will be used in output.
It would be annoying to determine which type a particular value is before it is put in the database, but it is possible.
My question is twofold:
Is there some hidden cost to extraneous indexes, beyond the size increase that I have seen?
Is there a better way to index for filtration queries of the form WHERE foo IN (5) AND bar IN (12,14,15)? Note that I don't know which columns the user will pick, beyond the that it will be a type 2 column.
Read the relevant documentation:
Query Planning;
Query Optimizer Overview;
EXPLAIN QUERY PLAN.
The most important thing for optimizing queries is avoiding I/O, so tables with less than ten rows should not be indexed because all the data fits into a single page anyway, so having an index would just force SQLite to read another page for the index.
Indexes are important when you are looking up records in a big table.
Extraneous indexes make table updates slower, because each index needs to be updated as well.
SQLite can use at most one index per table in a query.
This particular query could be optimized best by having a single index on the two columns foo and bar.
However, creating such indexes for all possible combinations of lookup columns is most likely not worth the effort.
If the queries are generated dynamically, the best idea probably is to create one index for each column that has good selectivity, and rely on SQLite to pick the best one.
And don't forget to run ANALYZE.

Do SQLite queries that return large result sets take more time?

When performing a SQLite query does the size of the returned data set affect how long the query takes? Lets assume for this question that I don't actually access any of the data in the result, I just want to know if the query itself takes longer. Lets also assume that I am simply selecting all rows and have no WHERE or ORDER BY clauses.
For example if I have two tables A and B. Let says table A has a million rows and table B has 10 rows and that both tables have the same number and types of columns. Will selecting all rows in table A take longer than selecting all rows in table B?
This is a follow up to my question How does a cursor refer to deleted rows?. I am guessing that if a during the query SQLite makes a copy of the data then queries that return large data sets may take longer, unless there is an optimization that only copies the query result data if there is a change to the data in the db while the query is still alive?
Depending on some details, yes, a query may take different amounts of time.
Example: I have a table with some 20k entries. I do a GLOB search that must try every line, with a LIMIT. If the LIMIT is met, the query can stop early. If not, it must go through the entire table (or JOIN). So searches with too many results return quicker than searches with only a few results.
If the query must run through the same amount of data, I don't expect there is a significant difference between a smaller and larger amount of selected rows. There will probably be IO cost, of course.

What would happen if a DataSet was returned duplicate named columns from SQL?

I am creating as stored procedure that brings back a bunch of data that I need from multiple tables, however the tables share some duplicate column names. It works fine in SQL but I am wondering what will happen and how I will differentiate between them once I am accessing them as DataRows from a DataSet. Anyone know?
It should automatically rename them by appending a number. For example, COLUMN_NAME, COLUMN_NAME1, and COLUMN_NAME2. But, this is difficult at best to maintain, and could cause trouble later.
To avoid this, you'll probably want to specify the names yourself using column aliases (the AS keyword):
SELECT t1.myColumn AS t1_col, t2.myColumn AS t2_col
FROM t1, t2

Resources