Datablocksize and Freespacepercent for Spools - teradata

Does the properties of Table like Datablocksize and Freespacepercent is applicable to Spool?
in that case if we have a Table A with Datablocksize x1 and Freespacepercent x2 and Table B with Datablocksize y1 and Freespacepercent y2
what would be the Datablocksize and Freespacepercent of the resulting spool created by joining the two tables?
Note: The Freespace percent property for Spool doesn't make sense though.

DatablockSize and FreespacePercent are table only options.
As you said Freespace would be totally useless for a spool, because it's temporary.
And spools always have a maximum blocksize of 255 sectors = 127.5 KB (1 MB in TD14.10).

Related

Why do DynamoDB imported size and actual table size differ so much?

I have imported a DynamoDB table from S3. Here are the dataset sizes at each step:
Compressed dataset in S3 (DynamoDB JSON format with GZIP compression) = 13.3GB
Imported table size according to the DynamoDB imports page (uncompressed) = 64.4GB
Imported item count = 376 736 126
Current table size according to the DynamoDB tables page (compressed?) = 41.5GB (less than at the import time!)
Item count = 380 528 674 (I have performed some insertions already)
Since the import time, the table was only growing.
What's the reason of a much lesser estimation of the actual table size? Is it because of an approximation of DynamoDB tables sizes in general? Or does DynamoDB apply any compression to the stored data?
The source S3 dataset should not have any duplicates: it is built using Athena by running a GROUP BY query on the DynamoDB table's key. So, I do not expect it to be a cause.
Each item has 4 attributes: PK is 2 long strings (blockchain addresses = 40 hex chars + extra ≈2–6 characters) + 1 long string (uint256 balance as a hex string ≤ 64 characters) + 1 numeric value. Table import format is DynamoDB JSON.
DynamoDB performs no compression.
The likely cause is how you are calculating the size of the table and number of items. If you are relying on the tables metadata, it is only updated every 6 hours and is an approximate value, which should not be relied upon for comparisons or validity checks.
The reason is in the DynamoDB JSON format's overhead. I am the author of the question, so an exact example should provide more clarity. Here is a random item I have:
{"Item":{"w":{"S":"65b88d0d0a1223eb96bccae06317a3155bc7e391"},"sk":{"S":"43ac8b882b7e06dde7c269f4f8aaadd5801bd974_"},"b":{"S":"6124fee993bc0000"},"n":{"N":"12661588"}}}
When imports from S3, DynamoDB import functionality bills per the total read uncompressed size. Which for this item results in 169 bytes (168 chars + newline).
However, when stored to DynamoDB, this item only occupies its fields capacity (see DynamoDB docs):
The size of a string is (length of attribute name) + (number of UTF-8-encoded bytes).
The size of a number is approximately (length of attribute name) + (1 byte per two significant digits) + (1 byte).
For this specific item the DynamoDB's native size estimation is:
w (string) = 1 + 40 chars
sk (string) = 2 + 41 chars
b (string) = 1 + 16 chars
n (number) = 1 + (8 significant digits / 2 = 4) + 1
Total is 107 bytes. Actually, current DynamoDB's estimation for this table is 108.95 bytes per item on average which is pretty close (some fields values vary in length, this particular example is nearly the shortest possible).
This results in about 100% – 108.95 / 169 = 35% size reduction when the data is actually stored in DynamoDB compared to the imported size. Which is very close to the results I have reported in the question: 64.4GB * 108.95 / 169 = 40.39GB ≈ 41.5GB.

SQLite RANDOM() function in CTE

I found behavior of RANDOM() function in SQLite, which doesn't seems correct.
I want to generate random groups using random RANDOM() and CASE. However, it looks like CTE is not behaving in a correct way.
First, let's create a table
DROP TABLE IF EXISTS tt10ROWS;
CREATE TEMP TABLE tt10ROWS (
some_int INTEGER);
INSERT INTO tt10ROWS VALUES
(1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
SELECT * FROM tt10ROWS;
Incorrect behaviour
WITH
-- 2.a add columns with random number and save in CTE
STEP_01 AS (
SELECT
*,
ABS(RANDOM()) % 4 + 1 AS RAND_1_TO_4
FROM tt10ROWS)
-- 2.b - get random group
select
*,
CASE
WHEN RAND_1_TO_4 = 1 THEN 'GROUP_01'
WHEN RAND_1_TO_4 = 2 THEN 'GROUP_02'
WHEN RAND_1_TO_4 = 3 THEN 'GROUP_03'
WHEN RAND_1_TO_4 = 4 THEN 'GROUP_04'
END AS GROUP_IT
from STEP_01;
Using such query we get a table, which generates correct values for RAND_1_TO_4 columns, but GROUP_IT column is incorrect. We can see, that groups don't match and some groups even missing.
Correct behaviour
I found a walkaround for such problem by creating a temporary table instead of using CTE. It helped.
-- 1.a - add column with random number 1-4 and save as TEMP TABLE
drop table if exists ttSTEP01;
CREATE TEMP TABLE ttSTEP01 AS
SELECT
*,
ABS(RANDOM()) % 4 + 1 AS RAND_1_TO_4
FROM tt10ROWS;
-- 1.b - get random group
select
*,
CASE
WHEN RAND_1_TO_4 = 1 THEN 'GROUP_01'
WHEN RAND_1_TO_4 = 2 THEN 'GROUP_02'
WHEN RAND_1_TO_4 = 3 THEN 'GROUP_03'
WHEN RAND_1_TO_4 = 4 THEN 'GROUP_04'
END AS GROUP_IT
from ttSTEP01;
QUESTION
What is the reasons behind such behaviour, where GROUP_IT column is not generated properly?
If you look at the bytecode generated by the incorrect query using EXPLAIN, you'll see that every time the RAND_1_TO_4 column is referenced, its value is re-calculated and a new random number is used (I suspect but aren't 100% sure this has something to do with how random() is a non-deterministic function). The null values are for those times when none of the CASE tests end up being true.
When you insert into a temporary table and then use that for the rest, the values of course remain static and it works as expected.

SQLite: How to reduce byte size of integer values?

I have a SQLite table (without row ID, but that's probably irrelevant, and without any indexes) where my rows contain the following data:
2 real values, one of which is the primary key
3 integers < 100
1 more field for integers, but currently always null
According to http://www.sqlite.org/datatype3.html, integer values can take 1, 2, 3, 4, 6 or 8 bytes according to their magnitude. Therefore I'd expect each row in my table to take up about 20 bytes. In reality, sqlite3_analyzer gives me for the table
Average payload per entry......................... 25.65
Maximum payload per entry......................... 26
which is somewhere in between the minimum value of 20 and the maximum of 32 (if all integers were stored with 4 bytes). Is it possible to give the DB a "hint" to use even smaller integer types wherever possible? Or how else can the discrepancy be explained? (I don't think it's indexes because there are none for this table.)
Similarly, on a previous table I had 2 real values + 2 small integers and each entry occupied slightly more than 24 bytes (which is also more than I would have expected).
Also, there is no way to store floats in single precision with SQLite right?
The actual record format has one integer for the header size, one integer for each column to describe the value's type, and all the data of the column values.
In this case, we have:
bytes
1 header size
6 column types
16 two real values
3 three small integers between 2 and 127
0 NULL
--
26

Optimization of Oracle Query

I have following query :
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1, A2,A3,A4,A5,A6,COUNT(A7) AS total,A8
FROM (
select a.* from table1 a
left join (select * from table_reject where name = 'smith') b on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
GROUP BY
CUBE(A1, A2,A3,A4,A5,A6,A8)
)INN
WHERE
INN.A1 IS NOT NULL AND
INN.A2 IS NULL AND
INN.A3 IS NULL AND
INN.A4 IS NULL AND
INN.A5 IS NULL AND
INN.A6 is NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
Total number records in table1 is around 8 million.
My problem is i need to optimize the above query in best possible way.I did tried to make index on column A8 of table1 and creating the index helped me to decrease the cost of query but execution time of query is more or less same when there was no index on the table1.
Any help would be appreciated.
Thanks
CUBE operation on large data set is really expensive, so you need to check do you really need all that data in inner query. because i see you are doing COUNT in inner and then on the outer query you have SUM of counts. so in other words, give me the row count of A7 at for all combination A1-A8 (-A7). then get only SUM for selected combinations filtered by WHERE clause. we can sure optimize this by limiting CUBE on certain column itself but very obvious things so far i have notice are as follows.
if you use below query and have right index o Table1 and Table_reject then both query can utilize the Index and reduce the data set needs to be join and further processed.
I am not 100% sure but yes Partial CUBE processing is possible and need to check that.
clustered index --> Table1 need on A8 And Table_Reject need clustered index on NAME.
non-clustered index--> Table1 need on A3,A9and Table_reject need on B3,B2
SELECT qry1.
(
SELECT A1, A2,A3,A4,A5,A6,A7,A8
FROM table1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
)qry1
LEFT JOIN
(
select B3,B2,ID
from table_reject
where name = 'smith'
)qry2
ON qry1.A3 = qry2.B3 and qry1.A9=qry2.B2
WHERE qry2.ID IS NULL
EDIT1:
I tried to find out what will be the difference in CUBE operator result if you do it on all Columns or you do it on only columns that you need it in result set. what I found is the way CUBE function works you do not need to perform CUBE on all columns. because at the end you just care about combinations generated by CUBE where A1 and A8 is NOT NULL.
Try this link and see the output.
enter link description here
Query 1 and Query2 is just inner most queries to compare the CUBE result set.
Query3 and Query4 is the same query that you are trying and you see the results are same in both case.
DECLARE #NEXT_DAY DATE = NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1,COUNT(A7) AS total,A8
FROM (
select a.a1,a.a7,a.a8
from table1 a
left join (select * from table_reject where name = 'smith') b
on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= #NEXT_DAY
GROUP BY
CUBE(A1,A8)
)INN
WHERE INN.A1 IS NOT NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
EDIT3
As I mentioned in the Comment this is a Round3 update. i can not change comment but i meant Edit3 instead Round3.
well the new change in your query is adding the WHERE A8 >= #NEXT_DAY condition in the inner most left join select where A8 >= #NEXT_DAY AND B.ID is null as well. that has improved the selection very much.
in your last comment you mentioned that query is taking 30-35 second and as you change the value of A8 it keep increasing. now with the execution time you didn't mentioned how much data is in the result set. why that is important? because if my query is returning 5M rows as a final result set that will going to spend 90% time in just droping that data on to UI, or output file what ever output method you are using. but actual performance should be measured how soon the query has started giving first couple of rows. because by that time Optimizer has already decided the execution paln and DB is executing that plan. yet i agree that if query is returning 100 rows and taking 10 seconds then something can be wrong with execution plan.
to demo that what I did is I created the dummy data. and perfomred your query against it.
i have table Test_CubeData with 9M rows in it with the same column numbers and data type you explained for your Table1. I have second table Table_Reject with 80K rows with number of columns and its datatype I figured out from query. To test the extreme side of this table; name column has only one value "smith" and ID is null for all 80K rows. so column values that can affect inner left join result will be B2 and B3.
in these tests i do not have any index on both tables. both are heap. and you see the results are in still few seconds with acceptable range of data in result set. as my result data set increases the completion time increases. if i create explained indexes then it will give me Index Seek operation
for all these tested cases. but at certain point that index will also exhaust and become Index Scan.
one sure example would be if my filter value for A8 column is smallest date value exist in that column. in that case Optimizer will see that all 9M rows need to be participate in inner select and CUBE and lot of data will be get processed in memory. which is expected. on the other hand lets see the another example of queries. i have unique 32873 values in A8 column and those values are almost equally distributed among 9M rows. so per single A8 values there are 260 to 300 rows. now if I execute the query for any single value smallest, largest, or any thing in between the query execution time should not change.
notice the highlighted text in each image below that indicated what the value of A8 filter is chosen,
important columns only in the select list instead using *, added A8 filter in the inner left join query, execution plan showing the TableScan operation on both table, query execution time in second,and total number of rows return by the query.
I hope that this will clear some doubts on performance of your query and will help you to set right expectation.
**Table Row Counts**
**TableScan_InnerLeftJoin**
**TableScan_FullQuery_248Rows**
**TableScan_FullQuery_5K**
**TableScan_FullQuery_56K**
**TableScan_FullQuery_480k**
You're calculating a potentially very large cube result on seven columns, and then discarding all the results except those that are logically just a group_by on column A1.
I suggest that you rewrite the query to just group by A1.

Group by ranges in SQLite

I have a SQLite table which contains a numeric field field_name. I need to group by ranges of this column, something like this: SELECT CAST(field_name/100 AS INT), COUNT(*) FROM table GROUP BY CAST(field_name/100 AS INT), but including ranges which have no value (COUNT for them should be 0). And I can't get how to perform such a query?
You can do this by using a join and (though kludgy) an extra table.
The extra table would contain each of the values you want a row for in the response to your query (this would not only fill in missing CAST(field_name/100 AS INT) values between your returned values, but also let you expand it such that if your current groups were 5, 6, 7 you could include 0 through 10.
In other flavors of SQL you'd be able to right join or full outer join, and you'd be on your way. Alas, SQLite doesn't offer these.
Accordingly, we'll use a cross join (join everything to everything) and then filter. If you've got a relatively small database or a small number of groups, you're in good shape. If you have large numbers of both, this will be a very intensive way to go about this (the cross join result will have #ofRowsOfData * #ofGroups rows, so watch out).
Example:
TABLE: groups_for_report
desired_group
-------------
0
1
2
3
4
5
6
Table: data
fieldname other_field
--------- -----------
250 somestuff
230 someotherstuff
600 stuff
you would use a query like
select groups_for_report.desired_group, count(data.fieldname)
from data
cross join groups_for_report
where CAST(fieldname/100.0 AS INT)=desired_group
group by desired_group;

Resources