why no index and/or PK on table DATABASECHANGELOG? - mariadb

Quote from MariaDB Galera Cluster - Known Limitations
All tables should have a primary key (multi-column primary keys are
supported). DELETE operations are unsupported on tables without a
primary key. Also, rows in tables without a primary key may appear in
a different order on different nodes.
Galera requires that every table should have a PK, or at least an index. This limation mainly because of replication (wsrep plugin).
We operate a Galera/MariaDB cluster and I see customers with DATABASECHANGELOG which has no index and PK. I guess this table is append-only (no update or delete operations)
I don't know Liquibase, that's why I ask for the reason of missing index and/or PK here. Should I open a bug report or do I don't understand this use case?
+----------------------------------------------------------------------------------------+------------+------------+-------------+---------------------+-----------------------+
| schema | table_rows | non_unique | cardinality | medium distribution | replication row reads |
+----------------------------------------------------------------------------------------+------------+------------+-------------+---------------------+-----------------------+
(...)
| xxx.DATABASECHANGELOG | 571 | NULL | NULL | 571.0000 | 326041.0000 |
| xxxx.DATABASECHANGELOG | 491 | NULL | NULL | 491.0000 | 241081.0000 |
| xxxxx.DATABASECHANGELOG | 433 | NULL | NULL | 433.0000 | 187489.0000 |
+----------------------------------------------------------------------------------------+------------+------------+-------------+---------------------+-----------------------+

Check out this Jira ticket
Liquibase was changed to not create a primary key in databasechangelog table because it introduced problems with key sizes and wasn't really necessary. I didn't put in a check and drop for an existing primary key yet. It should be dropped but doesn't cause a problem unless you are hitting an edge case where you have very long id, author and/or file paths.
There's also described a possible workaround:
A simple workaround could be to add a change set with a not
primaryKeyExists pre-condition and addPrimaryKey change. A more
involved workaround could be to create a plugin which overrides the
CreateDatabaseChangeLogTableGenerator and/or
StandardChangeLogHistoryService which implement PrioritizedService.
Below is an example of the simple workaround. I've optimized the
indexes to reduce table scans, sorting and bookmark lookups on SQL
Server, but it's probably equally applicable to Oracle. I was not
really concerned about the maximum key length of 900 bytes on SQL
Server being exceeded.
<?xml version="1.0" encoding="UTF-8" ?>
<databaseChangeLog
xmlns="http://www.liquibase.org/xml/ns/dbchangelog" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.liquibase.org/xml/ns/dbchangelog http://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-3.4.xsd">
<property name="liquibaseCatalogName" value=""/>
<property name="liquibaseSchemaName" value="${database.liquibaseSchemaName}"/>
<property name="databaseChangeLogTableName" value="${database.databaseChangeLogTableName}"/>
<property name="liquibaseTablespaceName" value=""/>
<changeSet id="1" author="your-name">
<preConditions onFail="MARK_RAN">
<primaryKeyExists
catalogName="${liquibaseCatalogName}"
schemaName="${liquibaseSchemaName}"
tableName="${databaseChangeLogTableName}"
primaryKeyName="PK_${databaseChangeLogTableName}"/>
</preConditions>
<dropPrimaryKey
catalogName="${liquibaseCatalogName}"
schemaName="${liquibaseSchemaName}"
tableName="${databaseChangeLogTableName}"
constraintName="PK_${databaseChangeLogTableName}"/>
</changeSet>
<changeSet id="2" author="your-name">
<createIndex
catalogName="${liquibaseCatalogName}"
schemaName="${liquibaseSchemaName}"
tableName="${databaseChangeLogTableName}"
indexName="IX_${databaseChangeLogTableName}_DATEEXECUTED_ORDEREXECUTED"
tablespace="${liquibaseTablespaceName}"
clustered="true">
<column name="DATEEXECUTED"/>
<column name="ORDEREXECUTED"/>
</createIndex>
</changeSet>
<changeSet id="3" author="your-name">
<addPrimaryKey
catalogName="${liquibaseCatalogName}"
schemaName="${liquibaseSchemaName}"
tableName="${databaseChangeLogTableName}"
constraintName="PK_${databaseChangeLogTableName}"
tablespace="${liquibaseTablespaceName}"
clustered="false"
columnNames="ID,AUTHOR,FILENAME"/>
</changeSet>
</databaseChangeLog>

Related

DynamoDB Global Secondary Index "Batch" Retrieval

I've see older posts around this but hoping to bring this topic up again. I have a table in DynamoDB that has a UUID for the primary key and I created a secondary global index (SGI) for a more business-friendly key. For example:
| account_id | email | first_name | last_name |
|------------ |---------------- |----------- |---------- |
| 4f9cb231... | linda#gmail.com | Linda | James |
| a0302e59... | bruce#gmail.com | Bruce | Thomas |
| 3e0c1dde... | harry#gmail.com | Harry | Styles |
If account_id is my primary key and email is my SGI, how do I query the table to get accounts with email in ('linda#gmail.com', 'harry#gmail.com')? I looked at the IN conditional expression but it doesn't appear to work with SGI. I'm using the go SDK v2 library but will take any guidance. Thanks.
Short answer, you can't.
DDB is designed to return a single item, via GetItem(), or a set of related items, via Query(). Related meaning that you're using a composite primary key (hash key & sort key) and the related items all have the same hash key (aka partition key).
Another way to think of it, you can't Query() a DDB Table/index. You can only Query() a specific partition in a table or index.
Scan() is the only operation that works across partitions in one shot. But scanning is very inefficient and costly since it reads the entire table every time.
You'll need to issue a GetItem() for every email you want returned.
Luckily, DDB now offers BatchGetItem() with will allow you to send multiple, up to 100, GetItem() requests in a single call. Saves a little bit of network time and automatically runs the requests in parallel; but otherwise is the little different from what your application could do itself directly with GetItem(). Make no mistake, BatchGetItem() is making individual GetItem() requests behind the scenes. In fact, the requests in a BatchGetItem() don't even have to be against the same tables/indexes. The cost for each request in a batch will be the same as if you'd used GetItem() directly.
One difference to make note of, BatchGetItem() can only return 16MB of data. So if your DDB items are large, you may not get as many returned as your requested.
For example, if you ask to retrieve 100 items, but each individual
item is 300 KB in size, the system returns 52 items (so as not to
exceed the 16 MB limit). It also returns an appropriate
UnprocessedKeys value so you can get the next page of results. If
desired, your application can include its own logic to assemble the
pages of results into one dataset.
Because you have a GSI with PK of email (from what I understand) you can use PartiQL command to get your batch of emails back. The API is called ExecuteStatment and you use a SQL like syntax:
SELECT * FROM mytable.myindex WHERE email IN ['email#email.com','email1#email.com']

Properly Indexing table per time on MariaDB

I believe this is only me not realizing something obvious.
I currently have a table of positions for a car tracking software.
The current structure is as follows:
CREATE TABLE `positions` (
`id` char(36) NOT NULL,
`vehicleId` char(36) DEFAULT NULL,
`time` datetime NOT NULL,
`date` date NOT NULL, -- date being time without the hours, minutes and seconds
`lat` decimal(10,7) NOT NULL,
`lng` decimal(10,7) NOT NULL,
`speed` int(11) NOT NULL,
`attributes` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL CHECK (json_valid(`attributes`)),
`created_at` datetime(6) NOT NULL DEFAULT current_timestamp(6),
`updated_at` timestamp(6) NULL DEFAULT current_timestamp(6) ON UPDATE current_timestamp(6),
PRIMARY KEY (`id`),
KEY `IDX_0605352b480db5b3769797b9e8` (`time`),
KEY `IDX_de42da506f977dddd80bc8e3ac` (`vehicleId`,`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
This table has positions from only one month, as i have a cron processes that executes once a month to remove all positions that are not from the current month.
Yet, this table got at around a million entries and queries on it became extremely slow.
I am trying to fetch all positions from a specific date and from a specific vehicle:
SELECT * FROM positions WHERE vehicleId='id here' AND date='date here';
But for some reason it is extremely slow.
Server is a Xeon E5-1630 v4 with 4 GB RAM and 160 GB SSD, Running Fedora 34(5.13.14-200.fc34.x86_64).
The server is running MariaDB server(10.5.12-MariaDB), Redis, Node.JS and Caddy
EDIT: Answering comments,
EXPLAIN SELECT * FROM positions WHERE vehicleId='5d634444-ed56-49b2-9628-ba51182391c1' AND date='2021-09-23';
+------+-------------+-----------+------+--------------------------------+--------------------------------+---------+-------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+------+--------------------------------+--------------------------------+---------+-------------+------+-----------------------+
| 1 | SIMPLE | positions | ref | IDX_de42da506f977dddd80bc8e3ac | IDX_de42da506f977dddd80bc8e3ac | 148 | const,const | 268 | Using index condition |
+------+-------------+-----------+------+--------------------------------+--------------------------------+---------+-------------+------+-----------------------+
innodb_buffer_pool_size is currently at 2GB(half of my server's memory)
It looks like the 2G innodb buffer pool size is exceeding the normal size of commonly used data. Options to investigate before getting more ram and increasing this are:
as vehicleId appears to be a UUID, UTF8MB4 is rather wasteful on size for this. It could be converted to ascii, latin1 or something else with 1 byte per char.
alter table positions modify vehicleID char(36) character set ascii DEFAULT NULL
Ensure you change other vehicleID types in other tables otherwise joins requiring character set conversion get rather expensive (as a recent user discovered)
Note also in 10.7.0 preview, uuid is a new datatype.
restrict retrieval
If you aren't using * you could restrict the retreival to just the fields needed. If reduced just to the index elements, this means a looking to other fields isn't needed. If attributes isn't need this prevents potentially other off-page lookups.
It looks like maybe vehicleID,time could be a composite primary key.
If this is the most common query, and the primary key isn't used elsewhere, this would increase the retrieval of non-secondary index elements. This would involve changing the query to use time ranges to most effectively use it.
Otherwise, look closer at RAM, especially ensure that MariaDB isn't swapping during query retrieval. Having buffer pool memory ending up in swap isn't useful.

A rudimentary way to store comments on a proposal webpage using SQLite

I am a software engineer, but I am very new to databases and I am trying to hack up a tool to show some demo.
I have an Apache server which serves a simple web page full of tables. Each row in the table has a proposal id and a link to a web page where the proposal is explained. So just two columns.
----------------------
| id | proposal |
|--------------------
| 1 | foo.html |
| 2 | bar.html |
----------------------
Now, I want to add a third column titled Comments where a user can leave comments.
------------------------------------------------
| id | proposal | Comments |
|-----------------------------------------------
| 1 | foo.html | x: great idea ! |
| | | y: +1 |
| 2 | bar.html | z: not for this release |
------------------------------------------------
I just want to quickly hack up something to show this as a demo and get feedback. I am planning to use SQLite to create a table per id and store the userid, comments in the table. People can add comment at the same time. I am planning to use lock to perform operations on the SQLite database. I am not worried about scaling just want to show and get feedback. Are there any major flaw in this implementation?
There are similar questions. But I am looking for a simplest possible implementation.
Table per ID; why would you want to do that? If you get a large number of proposals, the number of tables can get out of hand very quickly. You just need to keep an id column in the table to keep track of things and keep the number of tables in a sane figure.
The other drawback of using a table for each proposal is that you will not be able to use prepared statements for those, because table names cannot be bound as a parameter.
SQLite assumes the table name is 'a'
Add column
alter table a add column Comments text;
Insert comment
insert into a values (4,"hello.html","New Comment");
You need to provide values for the other two columns along with the new comment.

Analyze a scenario performance?

i want to design something like a dynamic form in which admin define each form fields.
i design 3 table: mainform table for shared properties, then formfield tables which have mainformID as a foreign key and define each form fields
e.g:
AutoID | FormID | FieldName
_____________________________
100 | Form1 | weight
101 | Form1 | height
102 | Form1 | color
103 | Form2 | Size
104 | Form2 | Type
....
at leas a formvalues table like bellow:
FormFieldID | Value | UniqueResponseID
___________________________________________
100 | 50px | 200
101 | 60px | 200
102 | Red | 200
100 | 30px | 201
101 | 20px | 201
102 | Black | 201
103 | 20x10 | 201
104 | Y | 201
....
for each form i have to join these 3 tables to catch all fields and values. i wonder if its the only way to design such a scenario? does it decrease sql performance? or is there any fast and better way?
This is a form of EAV, and I'm gonna assume you absolutely have to do it instead of the "static" design.
does it decrease sql performance?
Yes, getting a bunch of rows (under EAV) is always going to be slower than getting just one (under the static design).
or is there any fast and better way?
Not from the logical standpoint, but there are significant optimizations (for query performance at least) that can be done at the physical level. Specifically, you can carefully design your keys to minimize the I/O (by putting related data close together) and even eliminate the JOIN itself.
For example:
This model migrates keys through FOREIGN KEY hierarchy all the way down to the ATTRIBUTE_VALUE table. The resulting natural composite key in ATTRIBUTE_VALUE table enables us to:
Get all attributes1 of a given form by a single index range scan + table heap access on ATTRIBUTE_VALUE table, and without doing any JOINs at all. In addition to that, you can cluster2 it, eliminating the table heap access and leaving you with only the index range scan3.
If you need to only get the data for a specific response, change the order of the fields in the composite key, so the RESPONSE_ID is at the leading edge.
If you need both "by form" and "by response" queries, you'll need both indexes, at which point, I'd recommend secondary index to also cover4 the VALUE field.
For example:
-- Since we haven't used NONCLUSTERED clause, this is a B-tree
-- that covers all fields. Table heap doesn't exist.
CREATE TABLE ATTRIBUTE_VALUE (
FORM_ID INT,
ATTRIBUTE_NAME VARCHAR(50),
RESPONSE_ID INT,
VALUE VARCHAR(50),
PRIMARY KEY (FORM_ID, ATTRIBUTE_NAME, RESPONSE_ID)
-- FOREIGN KEYs omitted for brevity.
);
-- We have included VALUE, so this B-tree covers all fields as well.
CREATE UNIQUE INDEX ATTRIBUTE_VALUE_IE1 ON
ATTRIBUTE_VALUE (RESPONSE_ID, FORM_ID, ATTRIBUTE_NAME)
INCLUDE (VALUE);
1 Or a specific attribute, or a specific response for a specific attribute.
2 MS SQL Server clusters all tables by default, unless you specify NONCLUSTERED clause.
3 Friendliness to clustering and elimination of JOINs are some of the main strengths of natural keys (as opposed to surrogate keys). But they also make tables "fatter" and don't isolate from ON UPDATE CASCADE. I believe pros outweigh cons in this particular case. For more info on natural vs. surrogate keys, look here.
4 Fortunately, MS SQL Server supports including fields in index solely for covering purposes (as opposed to actually searching through the index). This makes the index leaner than a "normal" index on the same fields.
I like Branko's approach, and it is quite similar to metadata models i have created in the past, so this post is by way of extension to his. you may want to add a datatype table, which can work both for native types (int,varchar,bit,datetime etc.) and your own definitions (although i don't see the necessity off the cuff).
thence, Branko's "value" column becomes:
value_tinyint tinyint
value_int int
value_varchar varchar(xx)
etc.
with a datatype_id (probably tinyint) as a foreign key into the "mydatatype" table.
[excuse the lack of pretty ER diagrams like BD's]
mydatatype
datatype_id tinyint
code varchar(16)
description varchar(64) -- for reference purposes
This extension should:
a. save you a good deal of casting when reading or writing your data
b. allow both reads and writes with some easily constructed dynamic SQL
Furthermore (and maybe this is out of scope), you may want to store the order in which these objects are created/saved, as well as conditional display based on button push/checkbox/radio button selection etc.
I won't go into detail here, since i'm not sure you need these things, but if you do i'll check this every so often and respond with stuff.

How to handle additional columns in join tables when using Symfony?

Let's assume I have two Entities in my Symfony2 bundle, User and Group. Associated by a many-to-many relationship.
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
| USER | | USER_GROUP_REL | | GROUP |
├────────────────┤ ├────────────────┤ ├────────────────┤
| id# ├---------┤ user_id# | ┌----┤ id# |
| username | | group_id# ├----┘ | groupname |
| email | | created_date | | |
└────────────────┘ └────────────────┘ └────────────────┘
What would be a good practice or a good approach to add additional columns to the join table, like a created date which represents the date when User joined Group?
I know that I could use the QueryBuilder to write an INSERT statement.
But as far as I have not seen any INSERT example of QueryBuilder or native SQL which makes me believe that ORM/Doctrine try to avoid direct INSERT statements (e.g. for security reasons). Plus as far as I have understood Symfony and Doctrine I would be taken aback if such a common requirement wouldn't be covered by the framework.
You want to set a property of the relation. This is how it's done in doctrine:
doctrine 2 many to many (Products - Categories)
I answered that question with a use case (like yours).
This is an additional question / answer which considers the benefits and use cases: Doctrine 2 : Best way to manage many-to-many associations

Resources