Teradata space up issues - teradata

I have a difficult time understanding a procedure which we use currently for increase the space and decreasing the space (space up and down) from a reserve database.Below is a part of the space down job:
LOCKING DBC.DiskSpace FOR ACCESS
SELECT
(( (T1.MAXPERM) - (T1.MAXCURRENTPERM * T1.NUMAMPS))
- (T2.SPACE_LEFT + T1.NUMAMPS)) (FORMAT 'Z(15)9')
INTO :var_SpaceAdj
FROM Ctrl_Base.Space_Ctrl T2
,(
SELECT DATABASENAME, SUM(MAXPERM), MAX(CURRENTPERM), COUNT(*)
FROM DBC.DiskSpace
WHERE DATABASENAME = :inP_Database
GROUP BY 1)
AS T1 (DATABASENAME, MAXPERM, MAXCURRENTPERM, NUMAMPS)
WHERE T1.DATABASENAME = T2.DATABASENAME
AND (( (T1.MAXPERM) - (T1.MAXCURRENTPERM * T1.NUMAMPS))
- (T2.SPACE_LEFT + T1.NUMAMPS)) >= 1000000
;
Acn anybody please help me understand what's this doing?
We have a Ctrl_Base.Space_Ctrl table where we specify the % increase while space up and space left entry
Regards,
Amit

((T1.MAXPERM) - (T1.MAXCURRENTPERM * T1.NUMAMPS))
T1.MAXPERM - This is the total allocated space in the database
T1.MAXCURRENTPERM - This is the space from the AMP consuming the highest amount of space. Since data is distributed by the PI of table, uneven distribution of data can lead to a database reporting no space when the AMP consuming the highest space is unable to store any more data.
(T1.MAXCURRENTPERM * T1.NUMAMPS) - Calculates the consumed space for the database based on the AMP consuming the most space. Accounting for uneven data distribution in the database.
The derived table T1 should be straightforward. It is simply aggregated the space information to the database level form the AMP, DatabaseName level in DBC.DiskSpace.
The second half of the WHERE clause is placing a condition that the difference in space for the database in T1 and the Space_Left column of the control table is greater than or equal to 1M bytes.
Hope this helps.

Related

how to predict the IO count of mysql query?

As InnoDB organizes its data in B+ trees. The height of the tree affects the count of IO times which may be one of the main reasons that DB slows down.
So my question is how to predicate or calculate the height of the B+ tree (e.g. based on the count of pages which can be calculated by row size, page size, and row number), and thus to make a decision whether or not to partition the data to different masters.
https://www.percona.com/blog/2009/04/28/the_depth_of_a_b_tree/
Let N be the number of rows in the table.
Let B be the number of keys that fit in one B-tree node.
The depth of the tree is (log N) / (log B).
From the blog:
Let’s put some numbers in there. Say you have a billion rows, and you can currently fit 64 keys in a node. Then the depth of the tree is (log 109)/ log 64 ≈ 30/6 = 5. Now you rebuild the tree with keys half the size and you get log 109 / log 128 ≈ 30/7 = 4.3. Assuming the top 3 levels of the tree are in memory, then you go from 2 disk seeks on average to 1.3 disk seeks on average, for a 35% speedup.
I would also add that usually you don't have to optimize for I/O cost, because the data you use frequently should be in the InnoDB buffer pool, therefore it won't incur any I/O cost to read it. You should size your buffer pool sufficiently to make this true for most reads.
Simpler computation
The quick and dirty answer is log base 100, rounded up. That is, each node in the BTree has about 100 leaf nodes. In some circles, this is called fanout.
1K rows: 2 levels
1M rows: 3 levels
billion: 5 levels
trillion: 6 levels
These numbers work for "average" rows or indexes. Of course, you could have extremes of about 2 or 1000 for the fanout.
Exact depth
You can find the actual depth from some information_schema:
For Oracle's MySQL:
$where = "WHERE ( ( database_name = ? AND table_name = ? )
OR ( database_name = LOWER(?) AND table_name = LOWER(?) ) )";
$sql = "SELECT last_update,
n_rows,
'Data & PK' AS 'Type',
clustered_index_size * 16384 AS Bytes,
ROUND(clustered_index_size * 16384 / n_rows) AS 'Bytes/row',
clustered_index_size AS Pages,
ROUND(n_rows / clustered_index_size) AS 'Rows/page'
FROM mysql.innodb_table_stats
$where
UNION
SELECT last_update,
n_rows,
'Secondary Indexes' AS 'BTrees',
sum_of_other_index_sizes * 16384 AS Bytes,
ROUND(sum_of_other_index_sizes * 16384 / n_rows) AS 'Bytes/row',
sum_of_other_index_sizes AS Pages,
ROUND(n_rows / sum_of_other_index_sizes) AS 'Rows/page'
FROM mysql.innodb_table_stats
$where
AND sum_of_other_index_sizes > 0
";
For Percona:
/* to enable stats:
percona < 5.5: set global userstat_running = 1;
5.5: set global userstat = 1; */
$sql = "SELECT i.INDEX_NAME as Index_Name,
IF(ROWS_READ IS NULL, 'Unused',
IF(ROWS_READ > 2e9, 'Overflow', ROWS_READ)) as Rows_Read
FROM (
SELECT DISTINCT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME
FROM information_schema.STATISTICS
) i
LEFT JOIN information_schema.INDEX_STATISTICS s
ON i.TABLE_SCHEMA = s.TABLE_SCHEMA
AND i.TABLE_NAME = s.TABLE_NAME
AND i.INDEX_NAME = s.INDEX_NAME
WHERE i.TABLE_SCHEMA = ?
AND i.TABLE_NAME = ?
ORDER BY IF(i.INDEX_NAME = 'PRIMARY', 0, 1), i.INDEX_NAME";
(Those give more than just the depth.)
PRIMARY refers to the data's BTree. Names like "n_diff_pfx03" refers to the 3rd level of the BTree; the largest such number for a table indicates the total depth.
Row width
As for estimating the width of a row, see Bill's answer. Here's another approach:
Look up the size of each column (INT=4 bytes, use averages for VARs)
Sum those.
Multiply by between 2 and 3 (to allow for overhead of InnoDB)
Divide into 16K to get average number of leaf nodes.
Non-leaf nodes, plus index leaf nodes, are trickier because you need to understand exactly what represents a "row" in such nodes.
(Hence, my simplistic "100 rows per node".)
But who cares?
Here's another simplification that seems to work quite well. Since disk hits are the biggest performance item in queries, you need to "count the disk hits" as the first order of judging the performance of a query.
But look at the caching of blocks in the buffer_pool. A parent node is 100 times as likely to be recently touched as the child node.
So, the simplification is to "assume" that all non-leaf nodes are cached and all leaf nodes need to be fetched from disk. Hence the depth is not nearly as important as how many leaf node blocks are touched. This shoots down your "35% speedup" -- Sure 35% speedup for CPU, but virtually no speedup for I/O. And I/O is the important component.
Note that if you fetching the latest 20 rows of a table that is chronologically stored, they will be found in the last 1 (or maybe 2) blocks. If they are stored by a UUID, it is more likely to tale 20 blocks -- many more disk hits, hence much slower.
Secondary Indexes
The PRIMARY KEY is clustered with the data. That implies that a look by the PK needs to drill down one BTree. But a secondary index is implemented by a second BTree -- drill down it to find the PK, then drill down via the PK. When "counting the disk hits", you need to consider both BTrees. And consider the randomness (eg, for UUIDs) or not (date-ordered).
Writes
Find the block (possibly cached)
Update it
If necessary, deal with a block split
Flag the block as "dirty" in the buffer_pool
Eventually write it back to disk.
Step 1 may involve a read I/O; step 5 may involve a write I/O -- but you are not waiting for it to finish.
Index updates
UNIQUE indexes must be checked before finishing an INSERT. This involves a potentially-cached read I/O.
For a non-unique index, an entry in the "Change buffer" is made. (This lives in the buffer_pool.) Eventually that is merged with the appropriate block on disk. That is, no waiting for I/O when INSERTing a row (at least not waiting to update non-unique indexes).
Corollary: UNIQUE indexes are more costly. But is there really any need for more than 2 such indexes (including the PK)?

Ceph storage usable space calculation

Can some help me with below question.
How I can calculate total usable ceph storage space.
Lets say I have 3 nodes and each nodes has 6 OSD of 1TB disk . That is total of 18TB storage ( 3* 6TB ) .All these 18TB space is usable or some space will go for redundancy ?
Ceph has two important values: full and near-full ratios. Default for full is 95% and nearfull is 85%. (http://docs.ceph.com/docs/jewel/rados/configuration/mon-config-ref/)
If any OSD hits the full ratio it will stop accepting new write requrests (Read: you cluster stucks). You can raise this value, but be careful, because if OSD stops because there is no space left (at FS level), you may experience data loss.
That means, that you couldn't get more than full ratio out of your cluster, and for normal operations it's wise to not reach nearfull value.
For you case, with redundancy 3, you have 6*3 Tb of raw space, this translates to 6 TB of protected space, after multiplying by 0.85 you have 5.1Tb of normally usable space.
Two more unsolicited advises: Use at least 4 nodes (3 is a bare minimum to work, if one node is down, you have a trouble), and use lower values for near-full. I'd advice to have it around 0.7. In this case you will have (4 nodes, 6 * 1Tb OSD, /3, *.7) 5.6 Tb of usable space.

Best DataType for my Table in SQLite

I am at loss in datatype selection of table below in sqlite. Columns highlighted yellow are user inputs while rest are calculated fields. Screenshot display a spreadsheet not sqlite table!
From the knowledge I gained it's said that datatype selection (to a measurable extent) influence values of calculated fields.
For instance I make a sqlite table ELZ_A where all input fields are besides DATE towards left & all calculated fields towards right. If then I add triggers
UPDATE ELZ_A SET CURRENT_DENSITY = ROUND((LOAD / 2.721), 2);
UPDATE ELZ_A SET VOLTS_AVG = ROUND((VOLTS_T / ELEMENTS), 2);
UPDATE ELZ_A SET VOLTS_STNDR = ROUND(2.4 + ((12.75 / 2.721) * ((VOLTS_AVG - 2.4) / CURRENT_DENSITY) - ((90 - CATHOLYTE_TEMP) * 0.01) + ((32 - CATHOLYTE_CONC) * 0.02)), 4);
UPDATE ELZ_A SET KF_FACTOR = ROUND(((VOLTS_AVG - (90 - CATHOLYTE_TEMP) * 0.016 * ((LOAD / 2.721) / 5)) + ((32 - CATHOLYTE_CONC) * 0.033 * ((LOAD / 2.721) / 5) - 2.4)) / (LOAD / 2.721), 4);
As one can see calculated field VOLTS_STNDR is further used to calculate KF_FACTOR also I am rounding first two results upto 2 places while the latter upto 4...So how I can assure using data-type best suited for displayed data to get precise calculated answers akin to that what EXCEL calculates
Problem Explained Further #SQLite Studio v3.0.6 #Debian x32
All of my columns contains decimals except Elements (which are whole numbers) so I am using ...till EFFICIENCY are user inputs further are calculated columns.
Now I am having this problem
As one can observe where EFFICIENCY is a whole number productions becomes zero?? So why is this problem. Problem is mitigated by tuning DataTypes such as
...but this is further confusing, as I cannot comprehend the actual cause
SQLite uses dynamic typing, so what data type you declare does not matter for the DB itself.
In Excel, numbers are always stored with maximum precision; the formatting settings influence only how they are displayed.
So to get maximum accuracy, drop all ROUND() calls, and change the code that displays the data.
And calculated fields are best implemented with a view.

HashTable problems Complexity implementation

I coded a java implementation of Hashtable, and I want to test the complexity. The hash table is structured as ad array of double linked list(always implemented by me). The dimension of array is m. I implemented a division hashing function, multiplication one and universal one. For now I'm testing the first one hashing.
I've developed a testing suite made this way:
U (maximum value for a key) = 10000;
m (number of position in the hashkey) = 709;
n (number of element to be inserted) = variable.
So I made multiple insert, where gradually I inserted array with different n. I checked the time of execution with the System.nanoTime().
The graph that comes out is the next:
http://imgur.com/AVpKKZu
Supposed that insert is O(1), n insert are O(n). So should this graph be a O(n)?
If I change my values like this:
U = 1000000
m = 1009
n = variable-> ( I inserted once for time, array with incrementally dimension by 25000 elements, from the one with 25000 elements to the one with 800000 elements ).
The graph i got looks like a little strange:
http://imgur.com/l8OcQYJ
The unique key of elements to be inserted are chosen pseudo randomly between the universe of key U.
But, with different executions, also if I store the same keys in a file, the behavior of the graph always changes with some peaks.
Hope you may help me. If someone needs code, can comment and I will be pleasure to show.

Number of movements in a dynamic array

A dynamic array is an array that doubles its size, when an element is added to an already full array, copying the existing elements to a new place more details here. It is clear that there will be ceil(log(n)) of bulk copy operations.
In a textbook I have seen the number of movements M as being computed this way:
M=sum for {i=1} to {ceil(log(n))} of i*n/{2^i} with the argument that "half the elements move once, a quarter of the elements twice"...
But I thought that for each bulk copy operation the number of copied/moved elements is actually n/2^i, as every bulk operation is triggered by reaching and exceeding the 2^i th element, so that the number of movements is
M=sum for {i=1} to {ceil(log(n))} of n/{2^i} (for n=8 it seems to be the correct formula).
Who is right and what is wrong in the another argument?
Both versions are O(n), so there is no big difference.
The textbook version counts the initial write of each element as a move operation but doesn't consider the very first element, which will move ceil(log(n)) times. Other than that they are equivalent, i.e.
(sum for {i=1} to {ceil(log(n))} of i*n/{2^i}) - (n - 1) + ceil(log(n))
== sum for {i=1} to {ceil(log(n))} of n/{2^i}
when n is a power of 2. Both are off by different amounts when n is not a power of 2.

Resources