Is it possible to ignore certain columns that are almost definitely going to be different in a tSQLt AssertEqualsTable? Examples would be primary keys from the two results tables, insert/update date stamps, and so on.
I have been working around this by selecting only the relevant columns into new temp tables and comparing those instead, but this means extra work and extra places to make mistakes. Not a lot, sure, but it adds up over dozens or hundreds of tests.
A built-in or simple way to say 'compare these two tables but ignore columns X and Y' would be very useful. Is there a better solution than the one I'm using?
All you need to do is populate an #expected table with the columns you are interested in. When AssertEqualsTable does the comparison it will ignore any columns in the #actual table that don't exist in the #expected table.
Related
I have a MariaDB table which contains a lot of metadata and is very big in terms of bytes.
I have columns A, B in that table a long with other columns.
I would like to join that table with another table (stuff) in order to get column C from it.
So I have something like:
SELECT metadata.A, metadata.B, stuff.C FROM metadata JOIN
stuff on metadata.D = stuff.D
This query takes a very long time sometimes, I suspect its because (AFAIK, please correct me if Im wrong) that JOIN stores the result of the join in some side table and because metadata table is very big it has to copy a lot of data even though I dont use it, so I thought about optimizing it with WITH as follows:
WITH m as (SELECT A,B,D FROM metadata),
s as (SELECT C,D FROM stuff)
SELECT * FROM m JOIN s ON m.D = s.D;
The execution plan is the same (using EXPLAIN) but I think it will be faster since the side tables that will be created by WITH (again AFAIK WITH also creates side tables, please correct me if Im wrong) will be smaller and only contain the needed data.
Is my logic correct? Is there some way I can test that in MariaDB?
More likely, there is some form of cache speeding up one query or the other.
The Query cache is usually recognizable by a query time that is only about 1ms. It can be turned off via SELECT SQL_NO_CACHE ... to get a timing to compare against.
The other likely cache is the buffer_pool. Data is read from disk into the buffer_pool unless it is already there. The simple workaround for strange timings is to run the query twice and take the second 'time'.
Your hypothesis that WITH creates 'small' temp tables falls apart because of the work that is needed to read the original tables is the same with or without WITH.
Please provide SHOW CREATE TABLE for the two tables. There are a couple of datatype issues that may be involved -- big TEXTs or BLOBs.
The newly-added WITH opens up the possibility of recursive CTEs (and other things). And it provides a way to materialize a temp table that is used more than once. Neither of those applies in your query, so I would not expect any performance improvement.
Problem
Can't create a table with an index column that references multiple rows in a table. Picture example below of what I'm trying to create.
Overview
Imagine an (SQLite) table will hold stock dividend payments. The index column is set to the ticker symbols. However, each ticker symbol refers to multiple records, which are organized by a time stamp. The documentation on SQLite and about 15 other tutorials all seem to focus on indexing where there is always a 1:1 relationship between an index and a record. I would like to create an index with a 1:many relationship.
The lookup would find the appropriate stock by symbol, and then (probably) a secondary index on the dates in the first column. But I cannot find any examples where others have tried to set up this structure. Makes me think maybe I don't have the right approach, or this is just a special case.
I don't think your problem is actually a problem. Putting an index on a column doesn't mean it has to contain unique values. It's perfectly reasonable for values in an indexed column to repeat. Of course there are diminishing returns. E.g. If you have a million rows and only five different values in a column, an index on that column isn't really going to do much for you.
A good rule of thumb is to start with an index on the column(s) you're using in your where clause. Then run the queries and see if you're getting satisfactory performance.
I have a piece of software which takes in a database, and uses it to produce graphs based on what the user wants (primarily queries of the form SELECT AVG(<input1>) AS x, AVG(<intput2>) as y FROM <input3> WHERE <key> IN (<vals..> AND ...). This works nicely.
I have a simple script that is passed a (often large) number of files, each describing a row
name=foo
x=12
y=23.4
....... etc.......
The script goes through each file, saving the variable names, and an INSERT query for each. It then loads the variable names, sort | uniq's them, and makes a CREATE TABLE statement out of them (sqlite, amusingly enough, is ok with having all columns be NUMERIC, even if they actually end up containing text data). Once this is done, it then executes the INSERTS (in a single transaction, otherwise it would take ages).
To improve performance, I added an basic index on each row. However, this increases database size somewhat significantly, and only provides a moderate improvement.
Data comes in three basic types:
single value, indicating things like program version, etc.
a few values (<10), indicating things like input parameters used
many values (>1000), primarily output data.
The first type obviously shouldn't need an index, since it will never be sorted upon.
The second type should have an index, because it will commonly be filtered by.
The third type probably shouldn't need an index, because it will be used in output.
It would be annoying to determine which type a particular value is before it is put in the database, but it is possible.
My question is twofold:
Is there some hidden cost to extraneous indexes, beyond the size increase that I have seen?
Is there a better way to index for filtration queries of the form WHERE foo IN (5) AND bar IN (12,14,15)? Note that I don't know which columns the user will pick, beyond the that it will be a type 2 column.
Read the relevant documentation:
Query Planning;
Query Optimizer Overview;
EXPLAIN QUERY PLAN.
The most important thing for optimizing queries is avoiding I/O, so tables with less than ten rows should not be indexed because all the data fits into a single page anyway, so having an index would just force SQLite to read another page for the index.
Indexes are important when you are looking up records in a big table.
Extraneous indexes make table updates slower, because each index needs to be updated as well.
SQLite can use at most one index per table in a query.
This particular query could be optimized best by having a single index on the two columns foo and bar.
However, creating such indexes for all possible combinations of lookup columns is most likely not worth the effort.
If the queries are generated dynamically, the best idea probably is to create one index for each column that has good selectivity, and rely on SQLite to pick the best one.
And don't forget to run ANALYZE.
I have values in a SQLite table* that contain a number of strings, of different lengths, joined by periods, something like this:
SomeApp.SomeNameSpace.InterestingString.NotInteresting
SomeApp.OtherNameSpace.WantThisOne.ReallyQuiteDull
SomeApp.OtherNameSpace.WantThisOne.AlsoDull
SomeApp.DifferentNameSpace.AlwaysWorthALook.LittleValue
I'd like to extract (in this case) the third period-delimited substring so I could write something like
SELECT interesting_string, COUNT(*)
FROM ( SELECT third_part_of_period_delimited_string(name) interesting_string )
GROUP BY interesting_string;
Obviously I can do this any number of ways programmatically; I'm wondering if there's any way to achieve this in a SQLite SELECT query?
* It's a SharpDevelop Profiler database, if anyone's curious
No.
You can, as you mention, work with the strings after you have selected them from the database. Or you can split them up into separate columns when they are stored.
If you do not have access to the code that is storing the data, you might want to consider reading the data in its entirety, splitting the strings and storing the split out tokens in separate columns in a new table. If the data is not too large, you might look at storing this table in a new memory database to give excellent performance.
Whether this is worthwhile depends on whether one pass to split the data strings can be made use of many times. If the data is constantly changing, then this scheme would probably not work well.
I am creating as stored procedure that brings back a bunch of data that I need from multiple tables, however the tables share some duplicate column names. It works fine in SQL but I am wondering what will happen and how I will differentiate between them once I am accessing them as DataRows from a DataSet. Anyone know?
It should automatically rename them by appending a number. For example, COLUMN_NAME, COLUMN_NAME1, and COLUMN_NAME2. But, this is difficult at best to maintain, and could cause trouble later.
To avoid this, you'll probably want to specify the names yourself using column aliases (the AS keyword):
SELECT t1.myColumn AS t1_col, t2.myColumn AS t2_col
FROM t1, t2