I want to report on various statistics on a specific Teradata database, particularly "free space". Should table skew be included in the calculation? For example, someone suggested the following query:
SELECT databasename
, SUM(maxperm)/1024/1024/1024 (DECIMAL(10,2)) AS space_allocated
, SUM(currentperm)/1024/1024/1024 (DECIMAL(10,2)) AS space_Used
, (MAX(currentperm)*COUNT(*)-SUM(currentperm))
/1024/1024/1024 (DECIMAL(10, 2)) AS skew_Size
, (space_used + skew_size) AS total_space_used
, (MIN(maxperm-currentperm)/1024/1024/1024) * COUNT(*) (DECIMAL(10,2))
AS free_Space
, CAST(total_space_used AS DECIMAL(10,2)) * 100
/ CAST(space_allocated AS DECIMAL(10,2)) AS pct_used
FROM DBC.diskspace
WHERE databasename = 'MyDatabase'
AND maxperm > 0
GROUP BY 1;
I'm particularly curious about the calculation of total_space_used and pct_used. Is it "proper" to account for skewed tables like this?
Definitely keep skewing into account. Your column free_space gives you exactly the free space with skewing of all existing tables considered.
Also the free_space assumes that all future tables are perfectly distributed. Without skewing.
Related
I am using the following query to obtain the current component serial number (tr_sim_sn) installed on the host device (tr_host_sn) from the most recent record in a transaction history table (PUB.tr_hist)
SELECT tr_sim_sn FROM PUB.tr_hist
WHERE tr_trnsactn_nbr = (SELECT max(tr_trnsactn_nbr)
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_lot = '99524136'
AND tr_part = '6684112-001')
The actual table has ~190 million records. The excerpt below contains only a few sample records, and only fields relevant to the search to illustrate the query above:
tr_sim_sn |tr_host_sn* |tr_host_pn |tr_domain |tr_trnsactn_nbr |tr_qty_loc
_______________|____________|_______________|___________|________________|___________
... |
356136072015140|99524135 |6684112-000 |vattal_us |178415271 |-1.0000000000
356136072015458|99524136 |6684112-001 |vattal_us |178424418 |-1.0000000000
356136072015458|99524136 |6684112-001 |vattal_us |178628048 |1.0000000000
356136072015050|99524136 |6684112-001 |vattal_us |178628051 |-1.0000000000
356136072015836|99524137 |6684112-005 |vattal_us |178645337 |-1.0000000000
...
* = key field
The excerpt illustrates multiple occurrences of tr_trnsactn_nbr for a single value of tr_host_sn. The largest value for tr_trnsactn_nbr corresponds to the current tr_sim_sn installed within tr_host_sn.
This query works, but it is very slow, ~8minutes.
I would appreciate suggestions to improve or refactor this query to improve its speed.
Check with your admins to determine when they last updated the SQL statistics. If the answer is "we don't know" or "never" then you might want to ask them to run the following 4gl program which will create a SQL script to accomplish that:
/* genUpdateSQL.p
*
* mpro dbName -p util/genUpdateSQL.p -param "tmp/updSQLstats.sql"
*
* sqlexp -user userName -password passWord -db dnName -S servicePort -infile tmp/updSQLstats.sql -outfile tmp/updSQLtats.log
*
*/
output to value( ( if session:parameter <> "" then session:parameter else "updSQLstats.sql" )).
for each _file no-lock where _hidden = no:
put unformatted
"UPDATE TABLE STATISTICS AND INDEX STATISTICS AND ALL COLUMN STATISTICS FOR PUB."
'"' _file._file-name '"' ";"
skip
.
put unformatted "commit work;" skip.
end.
output close.
return.
This will generate a script that updates statistics for all table and all indexes. You could edit the output to only update the tables and indexes that are part of this query if you want.
Also, if the admins are nervous they could, of course, try this on a test db or a restored backup before implementing in a production environment.
I am posting this as a response to my request for an improved query.
As it turns out, the following syntax features two distinct features that greatly improved the speed of the query. One is to include tr_domain search criteria in both main and nested portions of the query. Second is to narrow the search by increasing the number of search criteria, which in the following are all included in the nested section of the syntax:
SELECT tr_sim_sn,
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_trnsactn_nbr IN (
SELECT MAX(tr_trnsactn_nbr)
FROM PUB.tr_hist
WHERE tr_domain = 'vattal_us'
AND tr_part = '6684112-001'
AND tr_lot = '99524136'
AND tr_type = 'ISS-WO'
AND tr_qty_loc < 0)
This syntax results in ~0.5s response time. (credit to my colleague, Daniel V.)
To be fair, this query uses criteria outside the originally stated parameters that were included in the original post, making it difficult to impossible for others to attempt a reasonable answer. This omission was not on purpose of course, rather due to being fairly new to fundamentals of good query design. This query in part is a result of learning that when too-few or non-indexed fields are used as search criteria in a large table, it is sometimes helpful to narrow the search by increasing the number of search criteria items. The original had 3, this one has 5.
I'm making a flight tracking map that will need to pull live data from a sql lite db. I'm currently just using the sqlite executable to navigate the db and understand how to interact with it. Each aircraft is identified by a unique hex_ident. I want to get a list of all aircraft that have sent out a signal in the last minute as a way of identifying which aircraft are actually active right now. I tried
select distinct hex_ident, parsed_time
from squitters
where parsed_time >= Datetime('now','-1 minute')
I expected a list of 4 or 5 hex_idents only but I'm just getting a list of every entry (today's entries only) and some are outside the 1 minute bound. I'm new to sql so I don't really know how to do this yet. Here's what each entry looks like. The table is called squitters.
{
"message_type":"MSG",
"transmission_type":8,
"session_id":"111",
"aircraft_id":"11111",
"hex_ident":"A1B4FE",
"flight_id":"111111",
"generated_date":"2021/02/12",
"generated_time":"14:50:42.403",
"logged_date":"2021/02/12",
"logged_time":"14:50:42.385",
"callsign":"",
"altitude":"",
"ground_speed":"",
"track":"",
"lat":"",
"lon":"",
"vertical_rate":"",
"squawk":"",
"alert":"",
"emergency":"",
"spi":"",
"is_on_ground":0,
"parsed_time":"2021-02-12T19:50:42.413746"
}
Any ideas?
You must remove 'T' from the value of parsed_time or use datetime() for it also to make the comparison work:
where datetime(parsed_time) >= datetime('now', '-1 minute')
Note that datetime() function does not take into account microseconds, so if you need 100% accuracy, you must put them in the code with concatenation:
where replace(parsed_time, 'T', ' ') >=
datetime('now', '-1 minute') || substr(parsed_time, instr(parsed_time, '.'))
(be Kind, this is my first question and I did extensive Research here and on the net beforehand. Question Oracle ROWID for Sqoop Split-By Column did not really solve this issue, as the original Person asking resorted to using another column)
I am using sqoop to copy data from an Oracle 11 DB.
Unfortunately, some tables have no index, no Primary key, only partitions (date). These tables are very large, hundreds of millions if not billions of rows.
so far, I have decided to Access data in the source by explicitly adressing the partitions. That works well and Speeds up the process nicely.
I need to do the splits by data that resides in each and every table in order to avoid too many if- branches in my bash script. (we're talking some 200+ tables here)
I notice that a split by 8 Tasks results in very uneven spread of workload among the Tasks. I considered using Oracle ROWID to define the split.
To do this, I must define a boundary-query. In a Standard query 'select * from xyz' the rowid is not part of the result set. therefore, it is not an option to let Sqoop define the boundary-query from --query.
Now, when I run this, I am getting the error
ERROR tool.ImportTool: Encountered IOException running import job:
java.io.IOException: Sqoop does not have the splitter for the given SQL
data type. Please use either different split column (argument --split-by)
or lower the number of mappers to 1. Unknown SQL data type: -8
samples of ROWID :
AAJXFWAKPAAOqqKAAA
AAJXFWAKPAAOqqKAA+
AAJXFWAKPAAOqqKAA/
it is static and unique once it is created for any row.
I cast this funny datatype into something else in my boundary-query
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect
jdbc:oracle:thin:#127.0.0.1:port:mydb --username $USER --P --m 8
--split-by ROWID --boundary-query "select cast(min(ROWID) as varchar(18)), cast
( max(ROWID)as varchar(18)) from table where laufbzdt >
TO_DATE('2019-02-27', 'YYYY-MM-DD')" --query "select * from table
where laufbzdt > TO_DATE('2019-02-27', 'YYYY-MM-DD') and \$CONDITIONS "
--null-string '\\N'
--null-non-string '\\N'
But then I get ugly ROWIDs that are rejected by Oracle:
select * from table where laufbzdt > TO_DATE('2019-02-27', 'YYYY-MM-DD')
and ( ROWID >= 'AAJX6oAG聕聁AE聉N:' ) AND ( ROWID < 'AAJX6oAH⁖⁁AD䁔䀷' ) ,
Error Msg = ORA-01410: invalid ROWID
how can I resolve this properly?
I am a LINUX-Embryo and have painfully chewed myself through the Topics of bash-shell-scripting and Sqooping so far, but I would like to make better use of evenly spread mapper-task workload - it would cut sqoop-time in half, I guess, saving some 5 to 8 hours.
TIA!
wahlium
You can try ROWNUM, but I think sqoop import does not work with pseudocolumn.
I'm writing an SQLite select statement and want to pick out the first hit only that satisfy my criterion.
My problem is that I'm writing code inside a simulation framework that wraps my SQLite code before sending it to the database, and this wrapping already adds 'LIMIT 100' to the end of the code.
What I want to do:
SELECT x, y, z FROM myTable WHERE a = 0 ORDER BY y LIMIT 1
What happens when this simulation development framework has done its job:
SELECT x, y, z FROM myTable WHERE a = 0 ORDER BY y LIMIT 1 LIMIT 100
exec error near "LIMIT": syntax error
So my question is: How do I work around this limitation? Is there any way to still limit my results to give only one hit back despite that the statement will end in 'LIMIT 100'? I'm thinking something like creating a temporary table, add an index and filter on that, but my knowledge is limited to simple database queries.
I need a desired result with the less number of execution time.
I have a table which contains many rows (over 100k) , in this table a field name is notes varchar2(1800).
It contains following values:
notes
CASE Transfer
Surnames AAA : BBBB
Case Status ACCOUNT TXFERRED TO BORROWERS
Completed Date 25/09/2022
Task Group 16
Message sent at 12/10/2012 11:11:21
Sender : lynxfailures123#google.com
Recipient : LFRB568767#yahoo.com
Received : 21:31 12/12/2002
Rows should return with the values of(ACCOUNT TXFERRED TO BORROWERS).
I have used the following queries but it takes a long time(72150436 sec) to execute:
Select * from cps_case_history where (dbms_lob.instr(notes, 'ACCOUNT
TFR TO UFSS') > 1)
Select * from cps_case_history where notes like '%ACCOUNT TFR TO
UFSS%'
Could you please share us the exact query which will take less time to execute.
Can you try parallel hints. Optimizer hints
Select /*+ PARALLEL(a,8) */ a.* from cps_case_history a
where INSTR(NOTES,'Text you want to search') > 0; -- your condition
Replace 8 with 16 and see if the performance improves further.
Avoid % in beginning of the like operator
ie., where notes like '%Account...'
Updated answer : Try creating partition tables.You can go with range partitioning on completed_date column Partitioning