Properly Indexing table per time on MariaDB - mariadb

I believe this is only me not realizing something obvious.
I currently have a table of positions for a car tracking software.
The current structure is as follows:
CREATE TABLE `positions` (
`id` char(36) NOT NULL,
`vehicleId` char(36) DEFAULT NULL,
`time` datetime NOT NULL,
`date` date NOT NULL, -- date being time without the hours, minutes and seconds
`lat` decimal(10,7) NOT NULL,
`lng` decimal(10,7) NOT NULL,
`speed` int(11) NOT NULL,
`attributes` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL CHECK (json_valid(`attributes`)),
`created_at` datetime(6) NOT NULL DEFAULT current_timestamp(6),
`updated_at` timestamp(6) NULL DEFAULT current_timestamp(6) ON UPDATE current_timestamp(6),
PRIMARY KEY (`id`),
KEY `IDX_0605352b480db5b3769797b9e8` (`time`),
KEY `IDX_de42da506f977dddd80bc8e3ac` (`vehicleId`,`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
This table has positions from only one month, as i have a cron processes that executes once a month to remove all positions that are not from the current month.
Yet, this table got at around a million entries and queries on it became extremely slow.
I am trying to fetch all positions from a specific date and from a specific vehicle:
SELECT * FROM positions WHERE vehicleId='id here' AND date='date here';
But for some reason it is extremely slow.
Server is a Xeon E5-1630 v4 with 4 GB RAM and 160 GB SSD, Running Fedora 34(5.13.14-200.fc34.x86_64).
The server is running MariaDB server(10.5.12-MariaDB), Redis, Node.JS and Caddy
EDIT: Answering comments,
EXPLAIN SELECT * FROM positions WHERE vehicleId='5d634444-ed56-49b2-9628-ba51182391c1' AND date='2021-09-23';
+------+-------------+-----------+------+--------------------------------+--------------------------------+---------+-------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+------+--------------------------------+--------------------------------+---------+-------------+------+-----------------------+
| 1 | SIMPLE | positions | ref | IDX_de42da506f977dddd80bc8e3ac | IDX_de42da506f977dddd80bc8e3ac | 148 | const,const | 268 | Using index condition |
+------+-------------+-----------+------+--------------------------------+--------------------------------+---------+-------------+------+-----------------------+
innodb_buffer_pool_size is currently at 2GB(half of my server's memory)

It looks like the 2G innodb buffer pool size is exceeding the normal size of commonly used data. Options to investigate before getting more ram and increasing this are:
as vehicleId appears to be a UUID, UTF8MB4 is rather wasteful on size for this. It could be converted to ascii, latin1 or something else with 1 byte per char.
alter table positions modify vehicleID char(36) character set ascii DEFAULT NULL
Ensure you change other vehicleID types in other tables otherwise joins requiring character set conversion get rather expensive (as a recent user discovered)
Note also in 10.7.0 preview, uuid is a new datatype.
restrict retrieval
If you aren't using * you could restrict the retreival to just the fields needed. If reduced just to the index elements, this means a looking to other fields isn't needed. If attributes isn't need this prevents potentially other off-page lookups.
It looks like maybe vehicleID,time could be a composite primary key.
If this is the most common query, and the primary key isn't used elsewhere, this would increase the retrieval of non-secondary index elements. This would involve changing the query to use time ranges to most effectively use it.
Otherwise, look closer at RAM, especially ensure that MariaDB isn't swapping during query retrieval. Having buffer pool memory ending up in swap isn't useful.

Related

BLOB/TEXT column used in key specification without key length (1170)? [duplicate]

Entries in my table are uniquely identified by word that is 5-10 characters long and I use TINYTEXT(10) for the column. However, when I try to set it as PRIMARY key I get the error that size is missing.
From my limited understanding of the docs, Size for PRIMARY keys can be used to simplify a way to detect unique value i.e. when first few character (specified by Size) can be enough to consider it unique match. In my case, the size would differ from 5 to 10 (they are all latin1 so they are exact byte per character + 1 for the lenght). Two questions:
If i wanted to use TINYTEXT as PRIMARY key, which size should I specify? Maximum available - 10 in this case? Or should be the size strictly EXACT, for example, if my key is 6 character long word, but I specify Size for PK as 10 - it will try to read all 10 and will fail and throw me an exception?
How bad performance-wise would be to use [TINY]TEXT for the PK? All Google results lead me to opinions and statements "it is BAD, you are fired", but is it really true in this case, considering TINYTEXT is 255 max and I already specified max length to 10?
MySQL/MariaDB can index only the first characters of the text fields but not the whole text if it is too large. The maximum key size is 3072 bytes and any text field larger than that cannot be used as KEY. Therefore on text fields longer than 3072 bytes you must specify explicitly how much characters it will index. When using VARCHAR or CHAR it can be done directly because you explicitly set it when declaring the datatype. It's not the case with *TEXT - they do not have that option. The solution is to create the primary key like this:
CREATE TABLE mytbl (
name TEXT NOT NULL,
PRIMARY KEY idx_name(name(255))
);
The same trick can be done if you need to make primary key on a VARCHAR field which is larger than 3072 bytes, on BINARY fields and BLOBs. Anyway you can imagine that if two large and different texts start with the same characters at the first 3072 bytes at the beginning, they will be treated as equal by the system. That may be a problem.
It is generally bad idea to use text field as primary key. There are two reasons for that:
2.1. It takes much more processing time than using integers to search in the table (WHERE, JOINS, etc). The link is old but still relevant;
2.2. Any foreign key in another table must have the same datatype as the primary key. When you use text, this will waste disk space;
Note: the difference between *TEXT and VARCHAR is that the contents of the *TEXT fields are not stored inside the table but in outside memory location. Usually we do that when we need to store really large text.
for TINYTEXT can not specify the size. Use VARCHAR (size)
SQL Data Types
FYI, you can't specify a size for TINYTEXT in MySQL:
mysql> create table t1 ( t tinytext(10) );
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds
to your MySQL server version for the right syntax to use near '(10) )' at line 1
You can specify a length after TEXT, but it doesn't work the way you think it does. It means it will choose one of the family of TEXT types, the smallest type that supports at least the length you requested. But once it does that, it does not limit the length of input. It still accepts any data up to the maximum length of the type it chose.
mysql> create table t1 ( t text(10) );
Query OK, 0 rows affected (0.02 sec)
mysql> show create table t1\G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`t` tinytext
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
mysql> insert into t1 set t = repeat('a', 255);
Query OK, 1 row affected (0.01 sec)
mysql> select length(t) from t1;
+-----------+
| length(t) |
+-----------+
| 255 |
+-----------+

How to fix "UNIQUE constraint failed" from VACUUM (also INTEGRITY_CHECK fails)

I use an app which creates this SQLite DB with this table:
CREATE TABLE expense_report (_id INTEGER PRIMARY KEY, ...)
And for some reason that _id (which is the ROWID) became invalid in that DB.
When I scan the table I see that the last rows got an _id which was already being used long ago:
1,2,3...,1137,1138,...,1147,1149,...,12263,12264,1138,...,1148
I highlighted above the ranges in which I see that I have the same _id for completely different rows (the rest of the values do not match at all).
And querying this DB usually gets me inaccurate results due to that. For instance:
SELECT
(SELECT MAX(_ID) FROM expense_report) DirectMax
, (SELECT MAX(_ID) FROM (SELECT _ID FROM expense_report ORDER BY _ID DESC)) RealMax;
| DirectMax | RealMax |
| 1148 | 12264 |
And inserting a new row into this table via DB Browser for SQLite also generates an _id of 1149 (instead of 12265), so the problem becomes worse if I keep using this DB.
Running PRAGMA QUICK_CHECK or PRAGMA INTEGRITY_CHECK show this error response:
*** in database main ***
On page 1598 at right child: Rowid 12268 out of order
And running VACUUM also detects the problem but doesn't seem to be able to fix it:
Execution finished with errors.
Result: UNIQUE constraint failed: expense_report._id
Anyone knows a way to fix these duplicate ROWID values?

SQLite is very slow when performing .import on a large table

I'm running the following:
.mode tabs
CREATE TABLE mytable(mytextkey TEXT PRIMARY KEY, field1 INTEGER, field2 REAL);
.import mytable.tsv mytable
mytable.tsv is approx. 6 GB and 50 million rows. The process takes an extremely long time (hours) to run and it also completely throttles the performance of the entire system, I'm guessing because of temporary disk IO.
I don't understand why it takes so long and why it thrashes the disk so much, when I have plenty of free physical RAM it could use for temporary write.
How do I improve this process?
PS: Yes I did search for an previous question and answer, but nothing I found helped.
In Sqlite, a normal rowid table uses a 64-bit integer primary key. If you have a PK in the table definition that's anything but a single INTEGER column, that is instead treated as a unique index, and each row inserted has to update both the original table and that index, doubling the work (And in your case effectively doubling the storage requirements). If you instead make your table a WITHOUT ROWID one, the PK is a true PK and doesn't require an extra index table. That change alone should roughly halve both the time it takes to import your dataset and the size of the database. (If you have other indexes on the table, or use that PK as a foreign key in another table, it might not be worth making the change in the long run as it'll increase the amount of space needed for those tables by potentially a lot given the lengths of your keys; in that case, see Schwern's answer).
Sorting the input on the key column first can help too on large imports because there's less random access of b-tree pages and moving of data within those pages. Everything goes into the same page until it fills up and a new one is allocated and any needed rebalancing is done.
You can also turn on some unsafe settings that in normal usage aren't recommended because they can result in data loss or outright corruption, but if that happens during import because of a freak power outage or whatever, you can always just start over. In particular, setting the synchronous mode and journal type to OFF. That results in fewer disc writes over the course of the import.
My assumption is the problem is the text primary key. This requires building a large and expensive text index.
The primary key is a long nucleotide sequence (anywhere from 20 to 300 characters), field1 is a integer (between 1 and 1500) and field2 is a relative log ratio (between -10 and +10 roughly).
Text primary keys have few advantages and many drawbacks.
They require large, slow indexes. Slow to build, slow to query, slow to insert.
It's tempting to change text, exactly what you don't want a primary key to do.
Any table referencing it also requires storing and indexing text adding to bloat.
Joins with this table will be slower due to the text primary key.
Consider what happens when you make a new table which references this one.
create table othertable(
myrefrence references mytable, -- this is text
something integer,
otherthing integer
)
othertable now must store a copy of the entire sequence, bloating the table. Instead of being simple integers it now has a text column, bloating the table. And it must make its own text index, bloating the index, and slowing down joins and inserts.
Instead, use a normal, integer, autoincrementing primary key and make the sequence column unique (which is also indexed). This provides all the benefits of a text primary key with none of the drawbacks.
create table sequences(
id integer primary key autoincrement,
sequence text not null unique,
field1 integer not null,
field2 real not null
);
Now references to sequences are a simple integer.
Because the SQLite import process is not very customizable, getting your data into this table in SQLite efficiently requires a couple steps.
First, import your data into a table which does not yet exist. Make sure it has header fields matching your desired column names.
$ cat test.tsv
sequence field1 field2
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
sqlite> .import test.tsv import_sequences
As there's no indexing happening, this process should go pretty quick. SQLite made a table called import_sequences with everything of type text.
sqlite> .schema import_sequences
CREATE TABLE import_sequences(
"sequence" TEXT,
"field1" TEXT,
"field2" TEXT
);
sqlite> select * from import_sequences;
sequence field1 field2
---------- ---------- ----------
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
Now we create the final production table.
sqlite> create table sequences(
...> id integer primary key autoincrement,
...> sequence text not null unique,
...> field1 integer not null,
...> field2 real not null
...> );
For efficiency, normally you'd add the unique constraint after the import, but SQLite has very limited ability to alter a table and cannot alter an existing column except to change its name.
Now transfer the data from the import table into sequences. The primary key will be automatically populated.
insert into sequences (sequence, field1, field2)
select sequence, field1, field2
from import_sequences;
Because the sequence must be indexed this might not import any faster, but it will result in a much better and more efficient schema going forward. If you want efficiency consider a more robust database.
Once you've confirmed the data came over correctly, drop the import table.
The following settings helped speed things up tremendously.
PRAGMA journal_mode = OFF
PRAGMA cache_size = 7500000
PRAGMA synchronous = 0
PRAGMA temp_store = 2

Analyze a scenario performance?

i want to design something like a dynamic form in which admin define each form fields.
i design 3 table: mainform table for shared properties, then formfield tables which have mainformID as a foreign key and define each form fields
e.g:
AutoID | FormID | FieldName
_____________________________
100 | Form1 | weight
101 | Form1 | height
102 | Form1 | color
103 | Form2 | Size
104 | Form2 | Type
....
at leas a formvalues table like bellow:
FormFieldID | Value | UniqueResponseID
___________________________________________
100 | 50px | 200
101 | 60px | 200
102 | Red | 200
100 | 30px | 201
101 | 20px | 201
102 | Black | 201
103 | 20x10 | 201
104 | Y | 201
....
for each form i have to join these 3 tables to catch all fields and values. i wonder if its the only way to design such a scenario? does it decrease sql performance? or is there any fast and better way?
This is a form of EAV, and I'm gonna assume you absolutely have to do it instead of the "static" design.
does it decrease sql performance?
Yes, getting a bunch of rows (under EAV) is always going to be slower than getting just one (under the static design).
or is there any fast and better way?
Not from the logical standpoint, but there are significant optimizations (for query performance at least) that can be done at the physical level. Specifically, you can carefully design your keys to minimize the I/O (by putting related data close together) and even eliminate the JOIN itself.
For example:
This model migrates keys through FOREIGN KEY hierarchy all the way down to the ATTRIBUTE_VALUE table. The resulting natural composite key in ATTRIBUTE_VALUE table enables us to:
Get all attributes1 of a given form by a single index range scan + table heap access on ATTRIBUTE_VALUE table, and without doing any JOINs at all. In addition to that, you can cluster2 it, eliminating the table heap access and leaving you with only the index range scan3.
If you need to only get the data for a specific response, change the order of the fields in the composite key, so the RESPONSE_ID is at the leading edge.
If you need both "by form" and "by response" queries, you'll need both indexes, at which point, I'd recommend secondary index to also cover4 the VALUE field.
For example:
-- Since we haven't used NONCLUSTERED clause, this is a B-tree
-- that covers all fields. Table heap doesn't exist.
CREATE TABLE ATTRIBUTE_VALUE (
FORM_ID INT,
ATTRIBUTE_NAME VARCHAR(50),
RESPONSE_ID INT,
VALUE VARCHAR(50),
PRIMARY KEY (FORM_ID, ATTRIBUTE_NAME, RESPONSE_ID)
-- FOREIGN KEYs omitted for brevity.
);
-- We have included VALUE, so this B-tree covers all fields as well.
CREATE UNIQUE INDEX ATTRIBUTE_VALUE_IE1 ON
ATTRIBUTE_VALUE (RESPONSE_ID, FORM_ID, ATTRIBUTE_NAME)
INCLUDE (VALUE);
1 Or a specific attribute, or a specific response for a specific attribute.
2 MS SQL Server clusters all tables by default, unless you specify NONCLUSTERED clause.
3 Friendliness to clustering and elimination of JOINs are some of the main strengths of natural keys (as opposed to surrogate keys). But they also make tables "fatter" and don't isolate from ON UPDATE CASCADE. I believe pros outweigh cons in this particular case. For more info on natural vs. surrogate keys, look here.
4 Fortunately, MS SQL Server supports including fields in index solely for covering purposes (as opposed to actually searching through the index). This makes the index leaner than a "normal" index on the same fields.
I like Branko's approach, and it is quite similar to metadata models i have created in the past, so this post is by way of extension to his. you may want to add a datatype table, which can work both for native types (int,varchar,bit,datetime etc.) and your own definitions (although i don't see the necessity off the cuff).
thence, Branko's "value" column becomes:
value_tinyint tinyint
value_int int
value_varchar varchar(xx)
etc.
with a datatype_id (probably tinyint) as a foreign key into the "mydatatype" table.
[excuse the lack of pretty ER diagrams like BD's]
mydatatype
datatype_id tinyint
code varchar(16)
description varchar(64) -- for reference purposes
This extension should:
a. save you a good deal of casting when reading or writing your data
b. allow both reads and writes with some easily constructed dynamic SQL
Furthermore (and maybe this is out of scope), you may want to store the order in which these objects are created/saved, as well as conditional display based on button push/checkbox/radio button selection etc.
I won't go into detail here, since i'm not sure you need these things, but if you do i'll check this every so often and respond with stuff.

How could you get if a table is autoincrement or not from the metadata of an sqlite table?

In case you have several tables inside any sqlite database how could the get the information that they have an auto increment primary key or not?
For instance I am already aware that you could get some info concerning the columns of a table by simply querying this: pragma table_info(tablename_in_here)
It would be much better to get the auto increment column dynamically rather than setting up each corresponding model inside the source code with a boolean value.
Edit:
Let me use this table as an example:
CREATE TABLE "test" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
"name" TEXT NOT NULL
)
and this is the result table after executing pragma table_info("test")
cid | name | type | notnull | dflt_value | pk
0 | id | INTEGER | 1 | null | 1
1 | name | TEXT | 1 | null | 0
As you can see there is no information whether the id column is autoincrement or not
Edit2:
I looking for a solution that involves sqlite directly through a statement.
Special situations where the sqlite3 command in the terminal can be used to somehow parse the required information from inside are not acceptable. They do not work in situations where you are not allowed to execute commands in a terminal programmatically. Like in an Android app.
Autoincrementing primary keys must be declared as INTEGER PRIMARY KEY or some equivalent, so you can use the table_info date to detect them.
A column is an INTEGER PRIMARY KEY column if, in the PRAGMA table_info output,
the type is integer or INTEGER or any other case-insensitive variant; and
pk is set; and
pk is not set for any other column.
To check whether the column definition includes the AUTOINCREMENT keyword, you have to look directly into the sqlite_master table; SQLite has no other mechanism to access this information.
If this query returns a record, you have the AUTOINCREMENT keyword somewhere in the table definition (which might return a wrong result if this word is commented out):
SELECT 1
FROM sqlite_master
WHERE type = 'table'
AND name = 'tablename_in_here'
AND sql LIKE '%AUTOINCREMENT%'
You can parse the output of .schema. That will give you the sql commands as you used them to create your tables. If autoincrement was declared, you will see it in the output. This has the advantage that it will list all your tables too.

Resources