Here is an interesting one, it is only happening with one database file, not with any others that I have. I cured this problem but thought it was quite interesting.
I have a table -
<partial table>
CREATE TABLE [horsestats] (
[horseID] INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
[name] VARCHAR(30) NULL,
[flatrating] INTEGER DEFAULT '0' NULL
);
All of the zero values in the table are set by the default value, any other has been set by my software. About 70% of the records has a value set. So, if we run this -
Select horsestats.flatrating FROM horsestats WHERE horsestats.flatrating<>0 ORDER BY horsestats.flatrating DESC LIMIT 20;
We don't really expect this -
flatrating
0
0
0
etc
In fact only values that are 0 are listed, none of the values that are not zero are in the output. If we reverse it we might expect only values not 0 to be listed:
Select horsestats.flatrating FROM horsestats WHERE horsestats.flatrating=0 ORDER BY horsestats.flatrating DESC LIMIT 20;
But no, there are no records returned.
So what does this one get us (this is the place I started, because it is the first set that my software needs):
Select horsestats.flatrating FROM horsestats ORDER BY horsestats.flatrating DESC;
I bet your socks you guess wrong. It gets this:
flatrating
0
0
0
0
130
128
127
126
125
124
124
As I said this doesn't happen on any other database or table that I have. I'm going to fix it now by implicitly setting all values of zero to zero, I suspect this will put it right.
Actually, it didn't, if I run:
UPDATE horsestats SET horsestats.flatrating='0' WHERE horsestats.flatrating='0';
The problem remains, so it looks like I have to write that database file off as corrupt. In this case it is ok because I do have to load the majority of the data from elsewhere in a pre-load for the software.
So the question is Why?
Could Sqlite be doing a strange mix of ansi and numeric sort? Its the only thing I can think of to give that order of sort but also the value zero in this table does not seem to be numerically zero, though it behaves as expected once it is passed to my software.
I think your problem is that you're quoting the zero - this is making it a string. make a table like this:
CREATE TABLE [horsestats] (
[horseID] INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
[name] VARCHAR(30) NULL,
[flatrating] INTEGER DEFAULT 0 NULL
);
and it seems to work. alternatively, run an unquoted version of your update command:
UPDATE horsestats SET horsestats.flatrating=0 WHERE horsestats.flatrating='0';
Related
Entries in my table are uniquely identified by word that is 5-10 characters long and I use TINYTEXT(10) for the column. However, when I try to set it as PRIMARY key I get the error that size is missing.
From my limited understanding of the docs, Size for PRIMARY keys can be used to simplify a way to detect unique value i.e. when first few character (specified by Size) can be enough to consider it unique match. In my case, the size would differ from 5 to 10 (they are all latin1 so they are exact byte per character + 1 for the lenght). Two questions:
If i wanted to use TINYTEXT as PRIMARY key, which size should I specify? Maximum available - 10 in this case? Or should be the size strictly EXACT, for example, if my key is 6 character long word, but I specify Size for PK as 10 - it will try to read all 10 and will fail and throw me an exception?
How bad performance-wise would be to use [TINY]TEXT for the PK? All Google results lead me to opinions and statements "it is BAD, you are fired", but is it really true in this case, considering TINYTEXT is 255 max and I already specified max length to 10?
MySQL/MariaDB can index only the first characters of the text fields but not the whole text if it is too large. The maximum key size is 3072 bytes and any text field larger than that cannot be used as KEY. Therefore on text fields longer than 3072 bytes you must specify explicitly how much characters it will index. When using VARCHAR or CHAR it can be done directly because you explicitly set it when declaring the datatype. It's not the case with *TEXT - they do not have that option. The solution is to create the primary key like this:
CREATE TABLE mytbl (
name TEXT NOT NULL,
PRIMARY KEY idx_name(name(255))
);
The same trick can be done if you need to make primary key on a VARCHAR field which is larger than 3072 bytes, on BINARY fields and BLOBs. Anyway you can imagine that if two large and different texts start with the same characters at the first 3072 bytes at the beginning, they will be treated as equal by the system. That may be a problem.
It is generally bad idea to use text field as primary key. There are two reasons for that:
2.1. It takes much more processing time than using integers to search in the table (WHERE, JOINS, etc). The link is old but still relevant;
2.2. Any foreign key in another table must have the same datatype as the primary key. When you use text, this will waste disk space;
Note: the difference between *TEXT and VARCHAR is that the contents of the *TEXT fields are not stored inside the table but in outside memory location. Usually we do that when we need to store really large text.
for TINYTEXT can not specify the size. Use VARCHAR (size)
SQL Data Types
FYI, you can't specify a size for TINYTEXT in MySQL:
mysql> create table t1 ( t tinytext(10) );
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds
to your MySQL server version for the right syntax to use near '(10) )' at line 1
You can specify a length after TEXT, but it doesn't work the way you think it does. It means it will choose one of the family of TEXT types, the smallest type that supports at least the length you requested. But once it does that, it does not limit the length of input. It still accepts any data up to the maximum length of the type it chose.
mysql> create table t1 ( t text(10) );
Query OK, 0 rows affected (0.02 sec)
mysql> show create table t1\G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`t` tinytext
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
mysql> insert into t1 set t = repeat('a', 255);
Query OK, 1 row affected (0.01 sec)
mysql> select length(t) from t1;
+-----------+
| length(t) |
+-----------+
| 255 |
+-----------+
I'm running the following:
.mode tabs
CREATE TABLE mytable(mytextkey TEXT PRIMARY KEY, field1 INTEGER, field2 REAL);
.import mytable.tsv mytable
mytable.tsv is approx. 6 GB and 50 million rows. The process takes an extremely long time (hours) to run and it also completely throttles the performance of the entire system, I'm guessing because of temporary disk IO.
I don't understand why it takes so long and why it thrashes the disk so much, when I have plenty of free physical RAM it could use for temporary write.
How do I improve this process?
PS: Yes I did search for an previous question and answer, but nothing I found helped.
In Sqlite, a normal rowid table uses a 64-bit integer primary key. If you have a PK in the table definition that's anything but a single INTEGER column, that is instead treated as a unique index, and each row inserted has to update both the original table and that index, doubling the work (And in your case effectively doubling the storage requirements). If you instead make your table a WITHOUT ROWID one, the PK is a true PK and doesn't require an extra index table. That change alone should roughly halve both the time it takes to import your dataset and the size of the database. (If you have other indexes on the table, or use that PK as a foreign key in another table, it might not be worth making the change in the long run as it'll increase the amount of space needed for those tables by potentially a lot given the lengths of your keys; in that case, see Schwern's answer).
Sorting the input on the key column first can help too on large imports because there's less random access of b-tree pages and moving of data within those pages. Everything goes into the same page until it fills up and a new one is allocated and any needed rebalancing is done.
You can also turn on some unsafe settings that in normal usage aren't recommended because they can result in data loss or outright corruption, but if that happens during import because of a freak power outage or whatever, you can always just start over. In particular, setting the synchronous mode and journal type to OFF. That results in fewer disc writes over the course of the import.
My assumption is the problem is the text primary key. This requires building a large and expensive text index.
The primary key is a long nucleotide sequence (anywhere from 20 to 300 characters), field1 is a integer (between 1 and 1500) and field2 is a relative log ratio (between -10 and +10 roughly).
Text primary keys have few advantages and many drawbacks.
They require large, slow indexes. Slow to build, slow to query, slow to insert.
It's tempting to change text, exactly what you don't want a primary key to do.
Any table referencing it also requires storing and indexing text adding to bloat.
Joins with this table will be slower due to the text primary key.
Consider what happens when you make a new table which references this one.
create table othertable(
myrefrence references mytable, -- this is text
something integer,
otherthing integer
)
othertable now must store a copy of the entire sequence, bloating the table. Instead of being simple integers it now has a text column, bloating the table. And it must make its own text index, bloating the index, and slowing down joins and inserts.
Instead, use a normal, integer, autoincrementing primary key and make the sequence column unique (which is also indexed). This provides all the benefits of a text primary key with none of the drawbacks.
create table sequences(
id integer primary key autoincrement,
sequence text not null unique,
field1 integer not null,
field2 real not null
);
Now references to sequences are a simple integer.
Because the SQLite import process is not very customizable, getting your data into this table in SQLite efficiently requires a couple steps.
First, import your data into a table which does not yet exist. Make sure it has header fields matching your desired column names.
$ cat test.tsv
sequence field1 field2
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
sqlite> .import test.tsv import_sequences
As there's no indexing happening, this process should go pretty quick. SQLite made a table called import_sequences with everything of type text.
sqlite> .schema import_sequences
CREATE TABLE import_sequences(
"sequence" TEXT,
"field1" TEXT,
"field2" TEXT
);
sqlite> select * from import_sequences;
sequence field1 field2
---------- ---------- ----------
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
Now we create the final production table.
sqlite> create table sequences(
...> id integer primary key autoincrement,
...> sequence text not null unique,
...> field1 integer not null,
...> field2 real not null
...> );
For efficiency, normally you'd add the unique constraint after the import, but SQLite has very limited ability to alter a table and cannot alter an existing column except to change its name.
Now transfer the data from the import table into sequences. The primary key will be automatically populated.
insert into sequences (sequence, field1, field2)
select sequence, field1, field2
from import_sequences;
Because the sequence must be indexed this might not import any faster, but it will result in a much better and more efficient schema going forward. If you want efficiency consider a more robust database.
Once you've confirmed the data came over correctly, drop the import table.
The following settings helped speed things up tremendously.
PRAGMA journal_mode = OFF
PRAGMA cache_size = 7500000
PRAGMA synchronous = 0
PRAGMA temp_store = 2
I want to create unique order numbers for each day. So ideally, in PostgreSQL for instance, I could create a sequence and read it back for these unique numbers, because the readback both gets me the new number and is atomic. Then at close of day, I'd reset the sequence.
In sqlite3, however, I only see an autoincrement for the integer field type. So say I set up a table with an autoincrement field, and insert a record to get the new number (seems like an awfully inefficient way to do it, but anyway...) When I go to read the max back, who is to say that another task hasn't gone in there and inserted ANOTHER record, thereby causing me to read back a miss, with my number one too far advanced (and a duplicate of what the other task reads back.)
Conceptually, I require:
fast lock with wait for other tasks
increment number
retrieve number
unlock
...I just don't see how to do that with sqlite3. Can anyone enlighten me?
In SQLite, autoincrementing fields are intended to be used as actual primary keys for their records.
You should just it as the ID for your orders table.
If you really want to have an atomic counter independent of corresponding table records, use a table with a single record.
ACID is ensured with transactions:
BEGIN;
SELECT number FROM MyTable;
UPDATE MyTable SET number = ? + 1;
COMMIT;
ok, looks like sqlite either doesn't have what I need, or I am missing it. Here's what I came up with:
declare zorder as integer primary key autoincrement, zuid integer in orders table
this means every new row gets an ascending number, starting with 1
generate a random number:
rnd = int(random.random() * 1000000) # unseeded python uses system time
create new order (just the SQL for simplicity):
'INSERT INTO orders (zuid) VALUES ('+str(rnd)+')'
find that exact order number using the random number:
'SELECT zorder FROM orders WHERE zuid = '+str(rnd)
pack away that number as the new order number (newordernum)
clobber the random number to reduce collision risks
'UPDATE orders SET zuid = 0 WHERE zorder = '+str(newordernum)
...and now I have a unique new order, I know what the correct order number is, the risk of a read collision is reduced to negligible, and I can prepare that order without concern that I'm trampling on another newly created order.
Just goes to show you why DB authors implement sequences, lol.
Background
I'm implementing full-text search over a body of email messages stored in SQLite, making use of its fantastic built-in FTS4 engine. I'm getting some rather poor query performance, although not exactly where I would expect. Let's take a look.
Representative schema
I'll give some simplified examples of the code in question, with links to the full code where applicable.
We've got a MessageTable that stores the data about an email message (full version spread out over several files here, here, and here):
CREATE TABLE MessageTable (
id INTEGER PRIMARY KEY,
internaldate_time_t INTEGER
);
CREATE INDEX MessageTableInternalDateTimeTIndex
ON MessageTable(internaldate_time_t);
The searchable text is added to an FTS4 table named MessageSearchTable (full version here):
CREATE VIRTUAL TABLE MessageSearchTable USING fts4(
id INTEGER PRIMARY KEY,
body
);
The id in the search table acts as a foreign key to the message table.
I'll leave it as an exercise for the reader to insert data into these tables (I certainly can't give out my private email). I have just under 26k records in each table.
Problem query
When we retrieve search results, we need them to be ordered descending by internaldate_time_t so we can pluck out only the most recent few results. Here's an example search query (full version here):
SELECT id
FROM MessageSearchTable
JOIN MessageTable USING (id)
WHERE MessageSearchTable MATCH 'a'
ORDER BY internaldate_time_t DESC
LIMIT 10 OFFSET 0
On my machine, with my email, that runs in about 150 milliseconds, as measured via:
time sqlite3 test.db <<<"..." > /dev/null
150 milliseconds is no beast of a query, but for a simple FTS lookup and indexed order, it's sluggish. If I omit the ORDER BY, it completes in 10 milliseconds, for example. Also keep in mind that the actual query has one more sub-select, so there's a little more work going on in general: the full version of the query runs in about 600 milliseconds, which is into beast territory, and omitting the ORDER BY in that case shaves 500 milliseconds off the time.
If I turn on stats inside sqlite3 and run the query, I notice the line:
Sort Operations: 1
If my interpretation of the docs about those stats is correct, it looks like the query is completely skipping using the MessageTableInternalDateTimeTIndex. The full version of the query also has the line:
Fullscan Steps: 25824
Sounds like it's walking the table somewhere, but let's ignore that for now.
What I've discovered
So let's work on optimizing that a little bit. I can rearrange the query into a sub-select and force SQLite to use our index with the INDEXED BY extension:
SELECT id
FROM MessageTable
INDEXED BY MessageTableInternalDateTimeTIndex
WHERE id IN (
SELECT id
FROM MessageSearchTable
WHERE MessageSearchTable MATCH 'a'
)
ORDER BY internaldate_time_t DESC
LIMIT 10 OFFSET 0
Lo and behold, the running time has dropped to around 100 milliseconds (300 milliseconds in the full version of the query, a 50% reduction in running time), and there are no sort operations reported. Note that with just reorganizing the query like this but not forcing the index with INDEXED BY, there's still a sort operation (though we've still shaved off a few milliseconds oddly enough), so it appears that SQLite is indeed ignoring our index unless we force it.
I've also tried some other things to see if they'd make a difference, but they didn't:
Explicitly making the index DESC as described here, with and without INDEXED BY
Explicitly adding the id column in the index, with and without internaldate_time_t ordered DESC, with and without INDEXED BY
Probably several other things I can't remember at this moment
Questions
100 milliseconds here still seems awfully slow for what seems like it should be a simple FTS lookup and indexed order.
What's going on here? Why is it ignoring the obvious index unless you force its hand?
Am I hitting some limitation with combining data from virtual and regular tables?
Why is it still so relatively slow, and is there anything else I can do to get FTS matches ordered by a field in another table?
Thanks!
An index is useful for looking up a table row based on the value of the indexed column.
Once a table row is found, indexes are no longer useful because it is not efficient to look up a table row in an index by any other criterium.
An implication of this is that it is not possible to use more than one index for each table accessed in a query.
Also see the documentation: Query Planning, Query Optimizer.
Your first query has the following EXPLAIN QUERY PLAN output:
0 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows)
0 1 1 SEARCH TABLE MessageTable USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0 0 0 USE TEMP B-TREE FOR ORDER BY
What happens is that
the FTS index is used to find all matching MessageSearchTable rows;
for each row found in 1., the MessageTable primary key index is used to find the matching row;
all rows found in 2. are sorted with a temporary table;
the first 10 rows are returned.
Your second query has the following EXPLAIN QUERY PLAN output:
0 0 0 SCAN TABLE MessageTable USING COVERING INDEX MessageTableInternalDateTimeTIndex (~100000 rows)
0 0 0 EXECUTE LIST SUBQUERY 1
1 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows)
What happens is that
the FTS index is used to find all matching MessageSearchTable rows;
SQLite goes through all entries in the MessageTableInternalDateTimeTIndex in the index order, and returns a row when the id value is one of the values found in step 1.
SQLite stops after the tenth such row.
In this query, it is possible to use the index for (implied) sorting, but only because no other index is used for looking up rows in this table.
Using an index in this way implies that SQLite has to go through all entries, instead of lookup up the few rows that match some other condition.
When you omit the INDEXED BY clause from your second query, you get the following EXPLAIN QUERY PLAN output:
0 0 0 SEARCH TABLE MessageTable USING INTEGER PRIMARY KEY (rowid=?) (~25 rows)
0 0 0 EXECUTE LIST SUBQUERY 1
1 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows)
0 0 0 USE TEMP B-TREE FOR ORDER BY
which is essentially the same as your first query, except that joins and subqueries are handled slightly differently.
With your table structure, it is not really possible to get faster.
You are doing three operations:
looking up rows in MessageSearchTable;
looking up corresponding rows in MessageTable;
sorting rows by a MessageTable value.
As far as indexes are concerned, steps 2 and 3 conflict with each other.
The database has to choose whether to use an index for step 2 (in which case sorting must be done explicitly) or for step 3 (in which case it has to go through all MessageTable entries).
You could try to return fewer records from the FTS search by making the message time a part of the FTS table and searching only for the last few days (and increasing or dropping the time if you don't get enough results).
I have this Table
CREATE TABLE IF NOT EXISTS `branch` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`studcount` int(11) DEFAULT NULL,
`username` varchar(64) NOT NULL,
`branch_fk` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FKADAF25A2A445F1AF` (`branch_fk`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=14 ;
ALTER TABLE `branch`
ADD CONSTRAINT `FKADAF25A24CEE7BFF` FOREIGN KEY (`login_fk`) REFERENCES `login` (`id`);
as you can see each table has a foreign key that point to other Branch Row (self Relation)
I want a Query using HQL(prefer HQL) to get a username (or id) from me and return a List<String> (for username) or List<Integer> (for id) that was a list of all of my subBranch;
let me show in Example
id studentcount username branch_fk
1 312 user01 NULL
2 111 user02 1
3 432 user03 1
4 543 user04 2
5 433 user05 3
6 312 user06 5
7 312 user06 2
8 312 user06 7
when I call GetSubBranch(3) I want return:
5, 6
and when call GetSubBranch(2) I want return:
4, 7, 8
I believe there is no portable SQL to do this.
Even more, I think several major databases' SQL cannot express this.
Therefore, this capability is not part of what you can do in HQL. Sorry :-(
I read a few ways to go. Most of them involve tradeoffs depending of the number of levels (fixed in advance ? how many ?) , the number of records (hundreds ? millions ?) etc :
Do the recursive queries yourself, leveling down each time (with a in(ids)), until some level is empty.
Do a query with a fixed number of left joins (your depth need to be known in advance ; or you might need to repeat the query to find the rest of the records if needed, see point 1).
Have the denormalized information available somewhere : it could be a denormalized table copying of the indexes. But I would prefer a cached in-memory copy, that may be filled completely in only one request, and be updated or invalidated ... depending on your other requisites, like the table size, max depth, write-frequency etc).
One may have a look at 'nested sets'. Querying becomes a matter of 'between :L and :R'. But topological/hierarchical sort is lost (in comparison to recursive/hierarchical queries). Inserting new items then is quite costly as it requires updates on several if not all rows ...