Creating partitions locks the table? - mariadb

I have a couple of big tables (from 60M rows to 2Bi rows) to create some partitions on it, as they are used in the core of our platform we are trying to find out if when we start to create the partitions in the production environment the database will generate any kind of lock on those tables?
We are on MariaDB 10.0.24 and 10.1.34.
ALTER TABLE data_history PARTITION BY RANGE ( period ) (
PARTITION past VALUES LESS THAN (201801),
PARTITION p2018 VALUES LESS THAN (201901),
PARTITION p2019 VALUES LESS THAN (202001),
PARTITION p2020 VALUES LESS THAN (202101),
PARTITION future VALUES LESS THAN MAXVALUE
);
period column is an integer field that goes with year+month (YYYYMM) format.

When adding PARTITIONing to a table, the entire table is copied over. Hence the lock.
When adding a PARTITION to an already-partitioned table, it depends on the actual command issued. All (?) cases will LOCK, but the length of the lock varies. Please provide the actual command.
Why partition? Partitioning does not intrinsically lead to any performance improvements. See this for discussion of the few cases where partitioning is beneficial: http://mysql.rjweb.org/doc.php/partitionmaint

Related

DynamoDB GSI table replication when partition key is same as the main table

Case 1) When we create a GSI with a partition key different from the main table's partition key, the dynamo replicates the data into another table under the hood. Which is understood.
Case 2) What if I create a GSI with the same partition key as the main table's PK but just with a different sort key? Will it replicate the data the same way as in Case 1? This situation sounds similar to an LSI because they also share the partition key with the main table. If I created an LSI instead, would it save me any data replication and hence the cost associated with it?
Yes, it replicates the same as Case 1. In general people should use GSIs unless they absolutely require LSIs.
Pros of an LSI:
Enables strongly consistent reads out of the index
Cons of an LSI:
Cannot be added or deleted after table creation
Prevents an item collection (items having the same PK) from growing beyond 10 GB (because to maintain strong reads the item collection has to be co-located)
Prevents adaptive capacity from isolating hot items in the item collection across different partitions (again, due to the need to be co-located)
Increases the likelihood of a hot partition because the base table write and LSI writes always go to the same partition, limiting write throughput to that partition (whereas a GSI has its own write capacity)
It's not actually true to say LSIs don't cost extra. They still consume write capacity, just out of the base table's allotment.
Any GSI regardless of the key is a separate table you pay extra for.
An LSI doesn't cost any extra quite as much as a GSI; especially if using a provision table. Additionally, an LSI has strongly consistent reads available just like the base table. GSI only offer eventually consistent reads.
However, the downside to using an LSI instead of a GSI, is that a table with an LSI is limited to a partition size of 10GB.
In other words, if you try to add data above 10GB in a table with the same partition (aka hash) key, if there's any LSIs it will fail.
If there are no LSIs, then it will succeed.
Item collection size limit
The maximum size of any item collection for
a table which has one or more local secondary indexes is 10 GB. This
does not apply to item collections in tables without local secondary
indexes, and also does not apply to item collections in global
secondary indexes. Only tables that have one or more local secondary
indexes are affected.
So depending on your data, it might behoove you to pay for the GSI even if an LSI would work instead.

How to query a DynamoDB global secondary index across multiple shards?

This article (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-gsi-sharding.html) talks about a technique for sharding global secondary index values across multiple partitions, by introducing a random integer as the partition key.
That makes sense to me, but the article does not clearly explain how to then query that index. Let's say I'm using a random integer from 1-10 as the partition key, and a number as the sort key, and I want to fetch the 3 records with the highest sort key value (from all partitions).
Would I need to do 10 separate queries, sorting each one, with a limit of 3 items, then do an in-memory sort of the resulting 30 items and pick the first 3? That seems needlessly complicated, and not very efficient for the client.
Is there some way to do a single DynamoDB operation that queries all 10 partitions, does the sorting, and just returns the 3 records with the highest vavlue?
Would I need to do 10 separate queries
Yes. This is called a scatter read in the Dynamo docs...
Normally the client would do so with multiple threads...so while it adds complexity, efficiency is usually good.
Why the limit 3? That requirement seems to be the bigger cause of inefficiency.
Is there some way to do a single DynamoDB operation that queries all 10 partitions, does the sorting, and just returns the 3 records with the highest vavlue?
The only way to query all partitions is with a full table Scan. But that doesn't provide sorting & ordering. You'd still need to do it in your app. The scan would be a lot less efficient than the scatter read.
If this is a "Top 3 sellers" type list...I believe the recommended practice to to (periodically) calculate & store the results. Rather than having to constantly derive the results. Take a look here: Using Global Secondary Indexes for Materialized Aggregation Queries

DynamoDB Insert with smallest range key when sparse range keys present for a partition key

For a specific partition key, I need to put an item in the table with smallest range key possible.
eg. If my table already has these four items
(partition key, range key) - (1, 1) (1, 2) (1, 5) (2, 1)
and I want to put an item for partition key 1, then I'll put (1, 3) in the table.
One way I can do is querying all the range keys for my required partition key. But I guess it would be a very expensive operation.
Is there any other better way?
Short Answer
Without knowing anything else about the data, there is no better way to find the first available unused sort key.
Longer Answer
Note that this general question (finding the least unused integer index) has been asked many times for SQL databases. Although SQL offers various ways to find unused keys (left outer joins, correlated subqueries, stored procedures), the performance is never better than scanning the index space as you have proposed.
To find keys faster, you would need to know in advance the available sort keys in the partition. That would likely involve creating and maintaining some sort of free space list (perhaps in a separate Dynamo table). Creating the free list for any partition would require a scan of all its sort keys, but subsequent insert operations could consult the free list to more efficiently find the next available key. Of course, you would have to keep the free list up to date for any insertions & deletions.
Finally, take a moment to consider how your solution will be affected by concurrency and DynamoDB's optimistic locking strategy. If multiple processes are writing to the same partition at the same time, the sort key value that is available when you find it might be unavailable by the time you go to do the insertion.

How can I improve performance while altering a large mysql table?

I have 600 Millions records in a table and I am not able to add a column in this table as every time I try to do it, it times out.
Suppose in your MYSQL database you have a giant table having 600 Millions of rows, having some schema operation such as adding a unique key, altering a column, even adding one more column to it is a very cumbersome process which will takes hours to process and sometimes there is a server time out. In order to overcome that, one to have to come up with very good migration plan, one of which I jotting below.
1) Suppose there is table Orig_X in which I have to add a new column colNew with default value as 0.
2) A Dummy table Dummy_X is created which is replica of Orig_X except with a new column colNew.
3) Data is inserted from the Orig_X to Dummy_X with the following settings.
4) Auto commit is set to zero, so that data is not committed after each insert statement hindering the performance.
5) Binary logs are set to zero, so that no data will be written in these logs.
6) After insertion of data bot the feature are set to one.
SET AUTOCOMMIT = 0;
SET sql_log_bin = 0;
Insert into Dummy_X(col1, col2, col3, colNew)
Select col1, col2, col3, from Orig_X;
SET sql_log_bin = 1;
SET AUTOCOMMIT = 1;
7) Now primary key can be created with the newly inserted column, which is now the part of primary key.
8) All the unique keys can now be created.
9) We can check the status of the server by issuing the following command
SHOW MASTER STATUS
10) It’s also helpful to issue FLUSH LOGS so MySQL will clear the old logs.
11) In order to boost performance to run the similar type of queries such as above insert statement, one should have query cache variable on.
SHOW VARIABLES LIKE 'have_query_cache';
query_cache_type = 1
Above were the steps for the migration strategy for the large table, below I am witting so steps to improve the performance of the database/queries.
1) Remove any unnecessary indexes on the table, pay particular attention to UNIQUE indexes as these when disable change buffering. Don't use a UNIQUE index if you have no reason for that constraint, prefer a regular INDEX.
2) If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all after data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes.
3) More memory can actually help in performance optimization. If SHOW ENGINE INNODB STATUS shows any reads/s under BUFFER POOL AND MEMORY and the number of Free buffers (also under BUFFER POOL AND MEMORY) is zero, you could benefit from more (assuming you have sized innodb_buffer_pool_size correctly on your server.
4) Normally your database table gets re-indexed after every insert. That's some heavy lifting for you database, but when your queries are wrapped inside a Transaction, the table does not get re-indexed until after this entire bulk is processed. Saving a lot of work.
5) Most MySQL servers have query caching enabled. It's one of the most effective methods of improving performance that is quietly handled by the database engine. When the same query is executed multiple times, the result is fetched from the cache, which is quite fast.
6) Using the EXPLAIN keyword can give you insight on what MySQL is doing to execute your query. This can help you spot the bottlenecks and other problems with your query or table structures. The results of an EXPLAIN query will show you which indexes are being utilized, how the table is being scanned and sorted etc...
7) If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
8) In every table have an id column that is the PRIMARY KEY, AUTO_INCREMENT and one of the flavors of INT. Also preferably UNSIGNED, since the value cannot be negative.
9) Even if you have a user’s table that has a unique username field, do not make that your primary key. VARCHAR fields as primary keys are slower. And you will have a better structure in your code by referring to all users with their id's internally.
10) Normally when you perform a query from a script, it will wait for the execution of that query to finish before it can continue. You can change that by using unbuffered queries. This saves a considerable amount of memory with SQL queries that produce large result sets, and you can start working on the result set immediately after the first row has been retrieved as you don't have to wait until the complete SQL query has been performed.
11) With database engines, disk is perhaps the most significant bottleneck. Keeping things smaller and more compact is usually helpful in terms of performance, to reduce the amount of disk transfer.
12) The two main storage engines in MySQL are MyISAM and InnoDB. Each have their own pros and cons.MyISAM is good for read-heavy applications, but it doesn't scale very well when there are a lot of writes. Even if you are updating one field of one row, the whole table gets locked, and no other process can even read from it until that query is finished. MyISAM is very fast at calculating
SELECT COUNT(*)
types of queries.InnoDB tends to be a more complicated storage
engine and can be slower than MyISAM for most small applications. But it supports row-based locking, which scales better. It also supports some more advanced features such as transactions.

Are Dynamodb UUID hashkeys better than sequentially generated ones

I think I understand the concept of not having hot hashKeys so that you use all the partitions in provisioning throughput. But do UUID hashKeys do a better job of distributing across the partitions than numerically sequenced ones? In both cases is a hashcode generated from the key and that value used to assign to a partition? If so, how do the hashcodes from two strings like: "100444" and "100445" differ? Are they close?
"100444" and "100445" are not any more likely to be in the same partition than a completely different number, like "12345" for example. Think of a DynamoDB table as a big hash table, where the hash key of the table is the key into the hash table. The underlying hash table is organized by the hash of the key, not by the key itself. You'll find that numbers and strings (UUIDs) both distribute fine in DynamoDB in terms of their distribution across partitions.
UUIDs are useful in DynamoDB because sequential numbers are difficult to generate in a scalable way for primary keys. Random numbers work well for primary keys, but sequential values are hard to generate without gaps and in a way that scales to the level of throughput that you can provision in a DynamoDB table. When you insert new items into a DynamoDB table, you can use conditional writes to ensure an item doesn't already exist with that primary key value.
(Note: this question is also cross-posted in this AWS Forums post and discussed there as well).

Resources