I have tried to dump a database db1 of about 40gb into sql file using mysqldump from system A with innodb default storage engine and tried to restore it on another system B. Both have the default storage engine as innodb and same mysql version . I have checked for any table corruptions on system A using check table status and was not able to find any table corruptions on it. I have used the below query to calculate the table size and no of rows per table on both databases (db1) over system A and system B and found that there was about 6GB data loss on db1 of system B.
SELECT table_schema,
-> SUM(data_length+index_length)/1024/1024 AS total_mb,
-> SUM(data_length)/1024/1024 AS data_mb,
-> SUM(index_length)/1024/1024 AS index_mb,
-> COUNT(*) AS tables,
-> CURDATE() AS today
-> FROM information_schema.tables
-> GROUP BY table_schema
-> ORDER BY 2 DESC
Can we rely on information schema for calculating the exact no of rows, exact tablesize (datalength + indexlength) when Innodb is default storage engine ? Why a dump using mysql dump has resulted in significant data loss on restoration over system B ?
InnoDB isn't able to give a exact count (using a SELECT COUNT() query) of records found in a table. When you request a record count on a table with the InnoDB engine, you will notice that the count will flucturate.
For more information I would like to refer you to the MySQL developer page for InnoDB
http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html
Restrictions on InnoDB Tables
ANALYZE TABLE determines index cardinality (as displayed in the Cardinality column of SHOW INDEX output) by doing eight random dives to each of the index trees and updating index cardinality estimates accordingly. Because these are only estimates, repeated runs of ANALYZE TABLE may produce different numbers. This makes ANALYZE TABLE fast on InnoDB tables but not 100% accurate because it does not take all rows into account.
MySQL uses index cardinality estimates only in join optimization. If some join is not optimized in the right way, you can try using ANALYZE TABLE. In the few cases that ANALYZE TABLE does not produce values good enough for your particular tables, you can use FORCE INDEX with your queries to force the use of a particular index, or set the max_seeks_for_key system variable to ensure that MySQL prefers index lookups over table scans. See Section 5.1.4, “Server System Variables”, and Section C.5.6, “Optimizer-Related Issues”.
SHOW TABLE STATUS does not give accurate statistics on InnoDB tables, except for the physical size reserved by the table. The row count is only a rough estimate used in SQL optimization.
InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. To process a SELECT COUNT(*) FROM t statement, InnoDB scans an index of the table, which takes some time if the index is not entirely in the buffer pool. If your table does not change often, using the MySQL query cache is a good solution. To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. If an approximate row count is sufficient, SHOW TABLE STATUS can be used. See Section 14.2.12.1, “InnoDB Performance Tuning Tips”.
The best solution to check if you have any data loss, is to compare the contents of your database.
mysqldump --skip-comments --skip-extended-insert -u root -p dbName1 > file1.sql
mysqldump --skip-comments --skip-extended-insert -u root -p dbName2 > file2.sql
diff file1.sql file2.sql
See this topic for more information.
Another advantage of this solution is that you can see where you have the differences.
Related
Something went wrong during a structure synchronization between two databases.
One of our production databases now is missing a key table 'customers' (which just about every other table has foreign keys to)
I'm trying to recreate the table from last night's backup (I don't want to restore the entire db - just recreate this table as the data in it does not change that much and I don't want to lose the transactional data from today)
The hassle seems to be that all the foreign key data for this table still exists in INFORMATION_SCHEMA.KEY_COLUMN_USAGE and I am getting 121 and 150 errors when I try run the CREATE TABLE query.
I've manually deleted all FK to the missing table and I am still getting errno 150 when trying to recreate the table. Any ideas where else there might be lost references to this table that is stopping me creating it again?
This was eventually resolved by multiple consultations of the SHOW ENGINE INNODB STATUS query.
The missing table had various indexes - example on the customer name there was an index "customer_name_idx". The CREATE TABLE query asked for this index to be created. The show engine innodb status return was "could not create table because index customer_name_idx already exists."
There was no reference to this index, to any primary key or to the table itself in any of the meta-data tables - I checked
INFORMATION_SCHEMA.INNODB_SYS_INDEXES
INFORMATION_SCHEMA.TABLE_SCHEMA
INFORMATION_SCHEMA.STATISTICS -INFORMATION_SCHEMA.TABLE
so I could not explain why this error was being thrown.
My guess, after the fact, is that MySQL is holding a cached copy of the information_schema meta data in memory and was consulting that, and maybe that only gets refreshed if you restart MySQL?
The solution was to give the indexes new names as a short term fix, and to rename them during our next scheduled downtime.
Once these were made, the table was created and the backup data could be reinstated.
Is there a way to clone a table in Kusto exactly so it has all the extents of the original table? Even if it's not possible to have extents retained , at least is there a performant way to copy a table to a new table. I tried the following:-
.set new_table <| existing_table;
It was running forever and got timeout error. Is there way to copy so the Kusto engine recognizes that this is just a dump copy so instead of using Kusto engine, it will just do a simple blob copy from back-end and simply point the new table to the copied blob thus bypassing the whole Kusto processing route?
1. Copying schema and data of one table to another is possible using the command you mentioned (another option to copy the data is to export its content into cloud storage, then ingest the result storage artifacts using Kusto's ingestion API or a tool that uses it, e.g. LightIngest or ADF)
Of course, if the source table has a lot of data, then you would want to split this command into multiple ones, each dealing with a subset of the source data (which you can 'partition', for example, by time).
Below is just one example (it obviously depends on how much data you have in the source table):
.set-or-append [async] new_table <| existing_table | where ingestion_time() > X and ingestion_time() < X + 1h
.set-or-append [async] new_table <| existing_table | where ingestion_time() >= X+1h and ingestion_time() < X + 2h
...
Note that the async is optional, and is to avoid the potential client-side-timeout (default after 10 minutes). the command itself continues to run on the backend for up to a non-configurable timout of 60 mins (though it's strongly advised to avoid such long-running commands, e.g. by performing the "partitioning" mentioned above).
2. To your other question: There's no option to copy data between tables without re-ingesting the data (an extent / data shard currently can't belong to more than 1 table).
3. If you need to "duplicate" data being ingestion into table T1 continuously into table T2, and both T1 and T2 are in the same database, you can achieve that using an update policy.
I need to run a query which joins 5 large table on user_id and filter it on proc_date.
I have planed to do partition on proc_date and partition(5 range partition) on user_id to increase query performance. I keep primary index as well on proc_date and user_id.
"But how can I run the query for just one partition of the user_id at a time? I want to restrict the query to join first partition(on User_id) of every table"
Reason behind this is, once I complete the query for first partition, I can send the output data for next process. While next process is running i can run the query for 2nd partition.
Could anyone please give me some solution to achieve this.
I have 2 table Table A, Table B. Both the tables are of size 500GB, Some of the columns of tables are as below.
Table A
ID
Type
DateModified
Added a new column to Table as CID, which is available in Table B.
Table B
ID
CID
DateGenerated
Table A is partitioned on dateModified, Table B is not partitioned, My task is to get the CID from Table B and update it in Table A. Both the tables are having billions of records.
I have tried Merge/SQL but its too slow, which cannot be completed in 2 days.
Adding a new column to an existing table causing row fragmentation. Updating the new column to some value will probably cause massive row chaining, partitioned or not. And yes, that is slow, even when there are sufficient indexes etc.
Recommended approach:
You are on Enterprise Edition since you have partitioning, so you might be able to solve this using the schema versions functionality.
But if this is a one time action and you do not know how to use it well, I would use a "create table ... as" approach. Building the table from scratch and then switching it when ready. Take care to not miss any trickle loaded transactions. With partitioning it will be fast (writing 500 GB at say 50 MB/sec on a strong server is not unrealistic, taking 3 hours).
I am new to Teradata. Can anyone tell me How exactly the AMPs going to helpful in creation of any table in Teradata.
Lets have a scenario.
I have a Teradata database with 4 AMPs. I learned that AMPs will usefull when we inserting the data into a table. Depending on the indexes it will distribute the data with the help of respected AMPs. But while creating the table, the command needs to execute through AMPs only. So i want to know which AMP will be used at that time??
The actual creation of the table in the data dictionary is a RowHash level operation involving a single AMP to store the record in DBC.TVM. Based on the other actions listed in the EXPLAIN there may be other AMPs involved as well but there is not single All-AMP operation. (This doesn't take into consideration the loading of the data and its distribution across the AMPs.)
Sample EXPLAIN:
1) First, we lock FUBAR.ABC for exclusive use.
2) Next, we lock a distinct DBC."pseudo table" for write on a RowHash
for deadlock prevention, we lock a distinct DBC."pseudo table" for
write on a RowHash for deadlock prevention, we lock a distinct
DBC."pseudo table" for read on a RowHash for deadlock prevention,
and we lock a distinct DBC."pseudo table" for write on a RowHash
for deadlock prevention.
3) We lock DBC.DBase for read on a RowHash, we lock DBC.Indexes for
write on a RowHash, we lock DBC.TVFields for write on a RowHash,
we lock DBC.TVM for write on a RowHash, and we lock
DBC.AccessRights for write on a RowHash.
4) We execute the following steps in parallel.
1) We do a single-AMP ABORT test from DBC.DBase by way of the
unique primary index.
2) We do a single-AMP ABORT test from DBC.TVM by way of the
unique primary index.
3) We do an INSERT into DBC.Indexes (no lock required).
4) We do an INSERT into DBC.TVFields (no lock required).
5) We do an INSERT into DBC.TVM (no lock required).
6) We INSERT default rights to DBC.AccessRights for FUBAR.ABC.
5) We create the table header.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.