How can I improve performance while altering a large mysql table? - bigdata

I have 600 Millions records in a table and I am not able to add a column in this table as every time I try to do it, it times out.

Suppose in your MYSQL database you have a giant table having 600 Millions of rows, having some schema operation such as adding a unique key, altering a column, even adding one more column to it is a very cumbersome process which will takes hours to process and sometimes there is a server time out. In order to overcome that, one to have to come up with very good migration plan, one of which I jotting below.
1) Suppose there is table Orig_X in which I have to add a new column colNew with default value as 0.
2) A Dummy table Dummy_X is created which is replica of Orig_X except with a new column colNew.
3) Data is inserted from the Orig_X to Dummy_X with the following settings.
4) Auto commit is set to zero, so that data is not committed after each insert statement hindering the performance.
5) Binary logs are set to zero, so that no data will be written in these logs.
6) After insertion of data bot the feature are set to one.
SET AUTOCOMMIT = 0;
SET sql_log_bin = 0;
Insert into Dummy_X(col1, col2, col3, colNew)
Select col1, col2, col3, from Orig_X;
SET sql_log_bin = 1;
SET AUTOCOMMIT = 1;
7) Now primary key can be created with the newly inserted column, which is now the part of primary key.
8) All the unique keys can now be created.
9) We can check the status of the server by issuing the following command
SHOW MASTER STATUS
10) It’s also helpful to issue FLUSH LOGS so MySQL will clear the old logs.
11) In order to boost performance to run the similar type of queries such as above insert statement, one should have query cache variable on.
SHOW VARIABLES LIKE 'have_query_cache';
query_cache_type = 1
Above were the steps for the migration strategy for the large table, below I am witting so steps to improve the performance of the database/queries.
1) Remove any unnecessary indexes on the table, pay particular attention to UNIQUE indexes as these when disable change buffering. Don't use a UNIQUE index if you have no reason for that constraint, prefer a regular INDEX.
2) If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all after data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes.
3) More memory can actually help in performance optimization. If SHOW ENGINE INNODB STATUS shows any reads/s under BUFFER POOL AND MEMORY and the number of Free buffers (also under BUFFER POOL AND MEMORY) is zero, you could benefit from more (assuming you have sized innodb_buffer_pool_size correctly on your server.
4) Normally your database table gets re-indexed after every insert. That's some heavy lifting for you database, but when your queries are wrapped inside a Transaction, the table does not get re-indexed until after this entire bulk is processed. Saving a lot of work.
5) Most MySQL servers have query caching enabled. It's one of the most effective methods of improving performance that is quietly handled by the database engine. When the same query is executed multiple times, the result is fetched from the cache, which is quite fast.
6) Using the EXPLAIN keyword can give you insight on what MySQL is doing to execute your query. This can help you spot the bottlenecks and other problems with your query or table structures. The results of an EXPLAIN query will show you which indexes are being utilized, how the table is being scanned and sorted etc...
7) If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
8) In every table have an id column that is the PRIMARY KEY, AUTO_INCREMENT and one of the flavors of INT. Also preferably UNSIGNED, since the value cannot be negative.
9) Even if you have a user’s table that has a unique username field, do not make that your primary key. VARCHAR fields as primary keys are slower. And you will have a better structure in your code by referring to all users with their id's internally.
10) Normally when you perform a query from a script, it will wait for the execution of that query to finish before it can continue. You can change that by using unbuffered queries. This saves a considerable amount of memory with SQL queries that produce large result sets, and you can start working on the result set immediately after the first row has been retrieved as you don't have to wait until the complete SQL query has been performed.
11) With database engines, disk is perhaps the most significant bottleneck. Keeping things smaller and more compact is usually helpful in terms of performance, to reduce the amount of disk transfer.
12) The two main storage engines in MySQL are MyISAM and InnoDB. Each have their own pros and cons.MyISAM is good for read-heavy applications, but it doesn't scale very well when there are a lot of writes. Even if you are updating one field of one row, the whole table gets locked, and no other process can even read from it until that query is finished. MyISAM is very fast at calculating
SELECT COUNT(*)
types of queries.InnoDB tends to be a more complicated storage
engine and can be slower than MyISAM for most small applications. But it supports row-based locking, which scales better. It also supports some more advanced features such as transactions.

Related

Does clickhouse support quick retrieval of any column?

I tried to use clickhouse to store 4 billion data, deployed on a single machine, 48-core cpu and 256g memory, mechanical hard disk.
My data has ten columns, and I want to quickly search any column through SQL statements, such as:
select * from table where key='mykeyword'; or select * from table where school='Yale';
I use order by to establish a sort key, order by (key, school, ...)
But when I search, only the first field ordered by key has very high performance. When searching for other fields, the query speed is very slow or even memory overflow (the memory allocation is already large enough)
So ask every expert, does clickhouse support such high-performance search for each column index similar to mysql? I also tried to create a secondary index for each column through index, but the performance did not improve.
You should try to understand how works sparse primary indexes
and how exactly right ORDER BY clause in CREATE TABLE help your query performance.
Clickhouse never will works the same way as mysql
Try to use PRIMARY KEY and ORDER BY in CREATE TABLE statement
and use fields with low value cardinality on first order in PRIMARY KEY
don't try to use ALL
SELECT * ...
it's really antipattern
moreover, maybe secondary data skip index may help you (but i'm not sure)
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes

Can a Select Count(*) Affect Writes in Cassandra

I experienced a scenario where a select count(*) on a table every minute (yes, this should definitely be avoided) caused a huge increase in Cassandra writes to around 150K writes per second.
Can anyone explain this weird behavior? Why would a Select query significantly increase write count in Cassandra?
Thanks!
If you check
org.apache.cassandra.metrics:type=ReadRepair,name=RepairedBackground
and
org.apache.cassandra.metrics:type=ReadRepair,name=RepairedBlocking
metrics you can see if its read repairs sending mutations. Perhaps reading all the data to service the count(*) is causing a lot of read repairs if your data is inconsistent. If thats the case lowering the read_repair_chance and dclocal_read_repair_chance on the table (ALTER TABLE) could reduce load.
Other likely possibilities are:
You have tracing enabled (either globally or on the table) as some %.
Or if you use DSE and you have slow query's enabled.
A possible explanation could be found in the write path of an update:
During a write , Cassandra adds each new row to the database without checking on whether a duplicate record exists. This policy makes it possible that many versions of the same row may exist in the database.
Then
Most Cassandra installations store replicas of each row on two or more nodes. Each node performs compaction independently. This means that even though out-of-date versions of a row have been dropped from one node, they may still exist on another node.
And finally:
This is why Cassandra performs another round of comparisons during a read process. When a client requests data with a particular primary key, Cassandra retrieves many versions of the row from one or more replicas.

Is it possible and meaningful to execute an Array DML command containing BLOB data?

Is it possible to execute an Array DML INSERT or UPDATE statement passing a BLOB field data in the parameter array ? And the more important part of my question, if it is possible, will Array DML command containing BLOB data still be more efficient than executing commands one by one ?
I have noticed that TADParam has a AsBlobs indexed property so I assume it might be possible, but I haven't tried this yet because there's no mention of performance nor example showing this and because the indexed property is of type RawByteString which is not much suitable for my needs.
I'm using FireDAC and working with SQLite database (Params.BindMode = pbByNumber, so I'm using native SQLite INSERT with multiple VALUES). My aim is to store about 100k records containing pretty small BLOB data (about 1kB per record) as fast as possible (in cost of the FireDAC's abstraction).
The main point in your case is that you are using a SQLIte3 database.
With SQLite3, Array DML is "emulated" by FireDAC. Since it is a local instance, not a client-server instance, there is no need to prepare a bunch of rows, then send them at once to avoid network latency (as with Oracle or MS SQL).
Using Array DML may speed up your insertion process a little bit with SQLite3, but I doubt it will very high. Good plain INSERT with binding per number will work just fine.
The main tips about performance in your case will be:
Nest your process within a single transaction (or even better, use one transaction per 1000 rows of data);
Prepare an INSERT statement, then re-execute it with a bound parameter each time;
By default, FireDAC initialize SQLite3 with the fastest options (e.g. disabling LOCK), so let it be.
SQlite3 is very good about BLOB process.
From my tests, FireDAC insertion timing is pretty good, very close to direct SQlite3 access. Only reading is slower than a direct SQLite3 link, due to the overhead of the Delphi TDataSet class.

Reindexing a large SQL Server database to Lucene

We have a web service method which accepts some data and puts it in Lucene index. We use it to index new and updated entries from our asp.net web app.
These entries are stored in a large SQL Server table (20M rows and growing), and I need a way to be able to reindex the whole table in case if current index gets deleted or corrupted. I'm not sure what's the optimal way to retrieve chunks of data from a large table. Currently, we use the fact that the table has PK which is autoincrement, so we get chunks of 1000 rows until it starts to return nothing. Kind of like (in pseudo language):
i = 0
while (true)
{
SELECT col1, col2, col3 FROM mytable WHERE pk between i and i + 1000
.... if result is empty 20 times in a row, break ....
.... otherwise send result to web service to reindex ....
i = i + 1000
}
This way, we don't need to SELECT COUNT(*) which would be a big performance killer, and we just move up the pk values until we stop getting any results. This has it's con: if we have a hole greater than 20,000 values somewhere in the table, it will stop indexing assuming it reached the end, but that's a tradeoff we have to live for now.
Can anyone suggest a more efficient way of getting data from a table to index? I would assume we are not the first ones facing this problem - search engines are widely used nowadays :)
For what we do with Lucene, we rarely need to reindex everything. I can't remember coming across any case when all index would be corrupted (Lucene is actually quite safe/good at this), but it has been many times when individual items needed to be reindexed because of one reason or another. I'd say the most frequent reindexing patterns would be:
reindex items by given id (or set of ids)
reindex items by given period of time
The latter, of course, requires separate db index on the relevant date field(s) which should be a bit costly for 20M+ records but we decided to go for it (our biggest deployment had up to 10M records) as disk space is cheap these days anyway.
EDIT: added few explanations as per question author's comment.
If the source data structure changes, requiring reindexing of all records, our approach is to roll out new code which ensures all new data is correct (basically forms correct Lucene Document from this moment). Then after we can reindex things in batches (either manually or by hand), by providing relevant period ranges. This, to certain extent, also applies to Lucene version changes, too.
Why is a COUNT(*) a performance killer? What about MAX(id)? I'm thinking that a index would provide the information needed for those queries. You do have an index on your primary key, right?
I actually just figured it out - I can use IDENT_CURRENT(table_name) to get the last generated id, and use that instead of MAX() or Count() - this method should blow the other two away :)

sqlite3 bulk insert from C?

I came across the .import command to do this (bulk insert), but is there a query version of this which I can execute using sqlite3_exec().
I would just like to copy a small text file contents into a table.
A query version of this one below,
".import demotab.txt mytable"
Sqlite's performance doesn't benefit from bulk insert. Simply performing the inserts separately (but within a single transaction!) provides very good performance.
You might benefit from increasing sqlite's page cache size; that depends on the number of indexes and/or the order in which the data is inserted. If you don't have any indexes, for a pure insert, the cache size is likely not to matter much.
Be sure to use a prepared query, as opposed to regenerating a query plan in the innermost loop. It's extremely important to wrap the statements in a transaction since this avoids the need for the filesystem to sync the database to disk - afterall, partially a written transaction is atomically aborted anyhow, meaning that all fsync()'s are delayed until the transaction completes.
Finally, indexes will limit your insert performance since their creation is somewhat expensive. If you're really dealing with a lot of data and start off with an empty table, it may be beneficial to add the indexes after the data - though this isn't a huge factor.
Oh, and you might want to get one of those intel X25-E SSD's and ensure you have an AHCI controller ;-).
I'm maintaining an app with sqlite db's with about 500000000 rows (spread over several tables) - much of which was bulk inserted using plain old begin-insert-commit: it works fine.

Resources