As InnoDB organizes its data in B+ trees. The height of the tree affects the count of IO times which may be one of the main reasons that DB slows down.
So my question is how to predicate or calculate the height of the B+ tree (e.g. based on the count of pages which can be calculated by row size, page size, and row number), and thus to make a decision whether or not to partition the data to different masters.
https://www.percona.com/blog/2009/04/28/the_depth_of_a_b_tree/
Let N be the number of rows in the table.
Let B be the number of keys that fit in one B-tree node.
The depth of the tree is (log N) / (log B).
From the blog:
Let’s put some numbers in there. Say you have a billion rows, and you can currently fit 64 keys in a node. Then the depth of the tree is (log 109)/ log 64 ≈ 30/6 = 5. Now you rebuild the tree with keys half the size and you get log 109 / log 128 ≈ 30/7 = 4.3. Assuming the top 3 levels of the tree are in memory, then you go from 2 disk seeks on average to 1.3 disk seeks on average, for a 35% speedup.
I would also add that usually you don't have to optimize for I/O cost, because the data you use frequently should be in the InnoDB buffer pool, therefore it won't incur any I/O cost to read it. You should size your buffer pool sufficiently to make this true for most reads.
Simpler computation
The quick and dirty answer is log base 100, rounded up. That is, each node in the BTree has about 100 leaf nodes. In some circles, this is called fanout.
1K rows: 2 levels
1M rows: 3 levels
billion: 5 levels
trillion: 6 levels
These numbers work for "average" rows or indexes. Of course, you could have extremes of about 2 or 1000 for the fanout.
Exact depth
You can find the actual depth from some information_schema:
For Oracle's MySQL:
$where = "WHERE ( ( database_name = ? AND table_name = ? )
OR ( database_name = LOWER(?) AND table_name = LOWER(?) ) )";
$sql = "SELECT last_update,
n_rows,
'Data & PK' AS 'Type',
clustered_index_size * 16384 AS Bytes,
ROUND(clustered_index_size * 16384 / n_rows) AS 'Bytes/row',
clustered_index_size AS Pages,
ROUND(n_rows / clustered_index_size) AS 'Rows/page'
FROM mysql.innodb_table_stats
$where
UNION
SELECT last_update,
n_rows,
'Secondary Indexes' AS 'BTrees',
sum_of_other_index_sizes * 16384 AS Bytes,
ROUND(sum_of_other_index_sizes * 16384 / n_rows) AS 'Bytes/row',
sum_of_other_index_sizes AS Pages,
ROUND(n_rows / sum_of_other_index_sizes) AS 'Rows/page'
FROM mysql.innodb_table_stats
$where
AND sum_of_other_index_sizes > 0
";
For Percona:
/* to enable stats:
percona < 5.5: set global userstat_running = 1;
5.5: set global userstat = 1; */
$sql = "SELECT i.INDEX_NAME as Index_Name,
IF(ROWS_READ IS NULL, 'Unused',
IF(ROWS_READ > 2e9, 'Overflow', ROWS_READ)) as Rows_Read
FROM (
SELECT DISTINCT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME
FROM information_schema.STATISTICS
) i
LEFT JOIN information_schema.INDEX_STATISTICS s
ON i.TABLE_SCHEMA = s.TABLE_SCHEMA
AND i.TABLE_NAME = s.TABLE_NAME
AND i.INDEX_NAME = s.INDEX_NAME
WHERE i.TABLE_SCHEMA = ?
AND i.TABLE_NAME = ?
ORDER BY IF(i.INDEX_NAME = 'PRIMARY', 0, 1), i.INDEX_NAME";
(Those give more than just the depth.)
PRIMARY refers to the data's BTree. Names like "n_diff_pfx03" refers to the 3rd level of the BTree; the largest such number for a table indicates the total depth.
Row width
As for estimating the width of a row, see Bill's answer. Here's another approach:
Look up the size of each column (INT=4 bytes, use averages for VARs)
Sum those.
Multiply by between 2 and 3 (to allow for overhead of InnoDB)
Divide into 16K to get average number of leaf nodes.
Non-leaf nodes, plus index leaf nodes, are trickier because you need to understand exactly what represents a "row" in such nodes.
(Hence, my simplistic "100 rows per node".)
But who cares?
Here's another simplification that seems to work quite well. Since disk hits are the biggest performance item in queries, you need to "count the disk hits" as the first order of judging the performance of a query.
But look at the caching of blocks in the buffer_pool. A parent node is 100 times as likely to be recently touched as the child node.
So, the simplification is to "assume" that all non-leaf nodes are cached and all leaf nodes need to be fetched from disk. Hence the depth is not nearly as important as how many leaf node blocks are touched. This shoots down your "35% speedup" -- Sure 35% speedup for CPU, but virtually no speedup for I/O. And I/O is the important component.
Note that if you fetching the latest 20 rows of a table that is chronologically stored, they will be found in the last 1 (or maybe 2) blocks. If they are stored by a UUID, it is more likely to tale 20 blocks -- many more disk hits, hence much slower.
Secondary Indexes
The PRIMARY KEY is clustered with the data. That implies that a look by the PK needs to drill down one BTree. But a secondary index is implemented by a second BTree -- drill down it to find the PK, then drill down via the PK. When "counting the disk hits", you need to consider both BTrees. And consider the randomness (eg, for UUIDs) or not (date-ordered).
Writes
Find the block (possibly cached)
Update it
If necessary, deal with a block split
Flag the block as "dirty" in the buffer_pool
Eventually write it back to disk.
Step 1 may involve a read I/O; step 5 may involve a write I/O -- but you are not waiting for it to finish.
Index updates
UNIQUE indexes must be checked before finishing an INSERT. This involves a potentially-cached read I/O.
For a non-unique index, an entry in the "Change buffer" is made. (This lives in the buffer_pool.) Eventually that is merged with the appropriate block on disk. That is, no waiting for I/O when INSERTing a row (at least not waiting to update non-unique indexes).
Corollary: UNIQUE indexes are more costly. But is there really any need for more than 2 such indexes (including the PK)?
Related
I have a image of a USB with 3 partitions:
Partition 1: FAT32
Partition 2: exFAT
Partition 3: NTFS
I am making a program that goes trough the partitions, but I am unsure of how I can know how many partitions my program should look for. By looking at the raw data I can see that it has three partitions as expected, but off course my program doesnt know this.
I tried to look at "80 (0x50) 4 bytes Number of partition entries in array" but in my example it gave me value 128 (0x80000000).
Here are screenshots of hex from my example image.
Protective MBR
Partition table header (LBA 1)
signature=- HexLe=4546492050415254 HexBe=5452415020494645
revisionHexLe=000001 HexLe=4546492050415254 HexBe=5452415020494645
headerSizeDec=92 HexLe=5C000000 HexBe=0000005C
crc2OfHeaderDec=82845332 HexLe=941EF004 HexBe=04F01E94
reservedADec=0 HexLe=00000000 HexBe=00000000
currentLBADec=1 HexLe=0100000000000000 HexBe=0000000000000001
backupLBADec=30277631 HexLe=FFFFCD0100000000 HexBe=0000000001CDFFFF
firstUsableLBAForPartitionsDec=34 HexLe=2200000000000000 HexBe=0000000000000022
lastUsableLBADec=30277598 HexLe=DEFFCD0100000000 HexBe=0000000001CDFFDE
diskGUIDHexMe=8B3F71C5AF9D744D9CA3EBFF7D1F9DC9
startingLBAOfArrayOfPartitionEntriesDec=2 HexLe=0200000000000000 HexBe=0000000000000002
numberOfPartitionEntriesInArrayDec=128 HexLe=80000000 HexBe=00000080
sizeOfASinglePartitionEntryDec=128 HexLe=80000000 HexBe=00000080
crc2OfPartitionEntriesArrayDec=-2043475264 HexLe=C00A3386 HexBe=86330AC0
reservedBDec=00000000 HexLe=00000000 HexBe=00000000
We are going to look for partitions now at offset 1024
Partition entries (LBA 2–33)
Bit late, and you may have already figured it out by now.
Refer following figure from wiki page. Wiki page itself will provide you further information.
It is not possible to determine the number of partitions just by looking at the GUID partition table header at LBA1, you have to examine the partition entries and check whether the Partition type GUID says it is unused (all zeros) or not.
Number of partition entries in the header at offset 80 (=0x50) is the total number of entries as determined by the size of a single partition entry at offset 84.
I have a unidirectional graph.
The structure is as follows:
There are about 20,000 nodes in the graph.
I make the simplest request: MATCH (b1)-[:NEXT_BAR*10]->(b2) RETURN b1.id, b2.id LIMIT 5
The request is processed quickly.
But if I increase the number of relationships, the query takes much longer to process. In other words, the speed depends on the number of relationships.
This request takes longer than 5 minutes to complete: MATCH (b1)-[:NEXT_BAR*10000]->(b2) RETURN b1.id, b2.id LIMIT 5
This is still a simplified version. The request can have more than two nodes and the number of relationships can still be a range.
How can I optimize a query with a large number of relationships?
Perhaps there are other graph DBMS where there is no such problem?
Variable-length relationship queries have exponential time and memory complexity.
If R is the average number of suitable relationships per node, and D is the depth of the search, then the complexity is O(R ** D). This complexity will exist in any DBMS.
The theory is simple here, but there are a couple of intricacies in the query execution.
-[:NEXT_BAR*10000]- matches a path that is precisely 10000 edges in size, so query engine spends some time to find these paths. Another thing to mention is that in (b1)-[...]- >(b2), b1 and b2 are not specific, which means that the query engine has to scall all nodes. If there is a limit, yea, scall all should return a limited number of items. The whole execution also depends on the efficiency of variable-length path implementation.
Some of the following might help:
Is it feasible to start from a specific node?
If there are branches, the only hope is aggressive filtering because of exponential complexity (as cybersam well explained).
Use a smaller number in the variable expand, or a range, e.g., [NEXT_BAR*..10000]. In this case, the query engine will match any path up to 10000 in size (different semantics, but maybe applicable).
* means the DFS type of execution. On the other hand, BFS might be the right approach. Memgraph (DISCLAIMER: I'm the co-founder and CTO) also supports BFS type of execution with filtering lambda.
Here is a Python script I've used to generate and import data into Memgraph. By using small nodes_no you can quickly notice the execution patterns.
import mgclient
# Make a connection to the database.
connection = mgclient.connect(
host='127.0.0.1',
port=7687,
sslmode=mgclient.MG_SSLMODE_REQUIRE)
connection.autocommit = True
cursor = connection.cursor()
# Clean and setup database instance.
cursor.execute("""MATCH (n) DETACH DELETE n;""")
cursor.execute("""CREATE INDEX ON :Node(id);""")
# Import dataset.
nodes_no = 10
# Create nodes.
for identifier in range(0, nodes_no):
cursor.execute("""CREATE (:Node {id: "%s"});""" % identifier)
# Create edges.
for identifier in range(1, nodes_no):
cursor.execute("""
MATCH (start_node:Node {id: "%s"})
MATCH (end_node:Node {id: "%s"})
CREATE (start_node)-[:NEXT_BAR]->(end_node);
""" % (identifier - 1, identifier))
I coded a java implementation of Hashtable, and I want to test the complexity. The hash table is structured as ad array of double linked list(always implemented by me). The dimension of array is m. I implemented a division hashing function, multiplication one and universal one. For now I'm testing the first one hashing.
I've developed a testing suite made this way:
U (maximum value for a key) = 10000;
m (number of position in the hashkey) = 709;
n (number of element to be inserted) = variable.
So I made multiple insert, where gradually I inserted array with different n. I checked the time of execution with the System.nanoTime().
The graph that comes out is the next:
http://imgur.com/AVpKKZu
Supposed that insert is O(1), n insert are O(n). So should this graph be a O(n)?
If I change my values like this:
U = 1000000
m = 1009
n = variable-> ( I inserted once for time, array with incrementally dimension by 25000 elements, from the one with 25000 elements to the one with 800000 elements ).
The graph i got looks like a little strange:
http://imgur.com/l8OcQYJ
The unique key of elements to be inserted are chosen pseudo randomly between the universe of key U.
But, with different executions, also if I store the same keys in a file, the behavior of the graph always changes with some peaks.
Hope you may help me. If someone needs code, can comment and I will be pleasure to show.
I was trying to find the best work-group size for a problem and I figured out something that I couldn't justify for myself.
These are my results :
GlobalWorkSize {6400 6400 1}, WorkGroupSize {64 4 1}, Time(Milliseconds) = 44.18
GlobalWorkSize {6400 6400 1}, WorkGroupSize {4 64 1}, Time(Milliseconds) = 24.39
Swapping axes caused a twice faster execution. Why !?
By the way, I was using an AMD GPU.
Thanks :-)
EDIT :
This is the kernel (a Simple Matrix Transposition):
__kernel void transpose(__global float *input, __global float *output, const int size){
int i = get_global_id(0);
int j = get_global_id(1);
output[i*size + j] = input[j*size + i];
}
I agree with #Thomas, it most probably depends on your kernel. Most probably, in the second case you access memory in a coalescent way and/or make a full use of memory transaction.
Coalescence: When threads need to access elements in the memory the hardware tries to access these elements in as less as possible transactions i.e. if the thread 0 and the thread 1 have to access contiguous elements there will be only one transaction.
full use of a memory transaction: Let's say you have a GPU that fetches 32 bytes in one transaction. Therefore if you have 4 threads that need to fetch one int each you are using only half of the data fetched by the transaction; you waste the rest (assuming an int is 4 bytes).
To illustrate this, let's say that you have a n by n matrix to access. Your matrix is in row major, and you use n threads organized in one dimension. You have two possibilities:
Each workitem takes care of one column, looping through each column element one at a time.
Each workitem takes care of one line, looping through each line element one at a time.
It might be counter-intuitive, but the first solution will be able to make coalescent access while the second won't be. The reason is that when the first workitem will need to access the first element in the first column, the second workitem will access the first element in the second column and so on. These elements are contiguous in the memory. This is not the case for the second solution.
Now if you take the same example, and apply the solution 1 but this time you have 4 workitems instead of n and the same GPU I've just spoken before you'll most probably increase the time by a factor 2 since you will waste half of your memory transactions.
EDIT: Now that you posted your kernel I see that I forgot to mention something else.
With your kernel, it seems that choosing a local size of (1, 256) or (256, 1) is always a bad choice. In the first case 256 transactions will be necessary to read a column (each fetching 32 bytes out of which only 4 will be used - keeping in mind the same GPU of my previous examples) in input while 32 transactions will be necessary to write in output: You can write 8 floats in one transaction hence 32 transactions to write the 256 elements.
This is the same problem with a workgroup size of (256, 1) but this time using 32 transactions to read, and 256 to write.
So why the first size works better? It's because there is a cache system, that can mitigate the bad access for the read part. Therefore the size (1, 256) is good for the write part and the cache system handle the not very good read part, decreasing the number of necessary read transactions.
Note that the number of transactions decreases overall (taking into considerations all the workgroups within the NDRange). For example the first workgroup issues the 256 transactions, to read the 256 first elements of the first column. The second workgroup might just go in the cache to retrieve the elements of the second column because they were fetched by the transactions (of 32 bytes) issued by the first workgroup.
Now, I'm almost sure that you can do better than (1, 256) try (8, 32).
ok, let's say we have 100 billion items that need to be sorted.
Our memory is big enough for these items.
Can we still use List.sort (Merge sort) to sort them?
My concerns have two parts:
Will the extra spaces that mergesort need become a concern in this case?
Since we use immutable data structures, we have to repeatedly create new lists for the 100 billion items during sorting, will this become an disadvantage? in terms of performance?
For sorting 100 billion items, should I use array in this case?
The standard mergesort implementation is clever in not reallocating too much memory (the split-in-half at the beginning allocates no new memory). Given an input list of n conses, it will allocate n * log(n) list conses in the worst case (with an essentially identical best case). Given that the values of the elements themselves will be shared between the input, intermediary and output lists, you will only allocate 3 words by list cons, which means that the sort will allocate 3 * n * log(n) words in memory in total (for n = 100 billion, 3 * log(n) is 110, which is quite a huge constant factor).
On the other hand, garbage collection can collect some of that memory: the worst-case memory usage is total live memory, not total allocated memory. In fact, the intermediary lists built during the log(n) layers of recursive subcalls can be collected before any result is returned (they become dead at the same rate the final merge allocates new cells), so this algorithm keeps n additional live cons cells in the worst case, which means only 3*n words or 24*n bytes. For n = 100 billion, that means 2.4 additional terabytes of memory, just as much as you needed to store the spine of the input list in the first place.
Finally, if you keep no reference to the input list itself, you can collect the first half of it as soon as it is sorted, giving you a n/2 worst-case bound instead of n. And you can collect the first half of this first half while you sort the first half, giving you a n/4 worst-case bound instead of n/2. Going to the limit with this reasoning, I believe that with enough GC work, you can in fact sort the list entirely in place -- modulo some constant-size memory pool for a stop© first generation GC, whose size will impact the time performance of the algorithm.