Optimization of Oracle Query - oracle11g

I have following query :
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1, A2,A3,A4,A5,A6,COUNT(A7) AS total,A8
FROM (
select a.* from table1 a
left join (select * from table_reject where name = 'smith') b on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
GROUP BY
CUBE(A1, A2,A3,A4,A5,A6,A8)
)INN
WHERE
INN.A1 IS NOT NULL AND
INN.A2 IS NULL AND
INN.A3 IS NULL AND
INN.A4 IS NULL AND
INN.A5 IS NULL AND
INN.A6 is NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
Total number records in table1 is around 8 million.
My problem is i need to optimize the above query in best possible way.I did tried to make index on column A8 of table1 and creating the index helped me to decrease the cost of query but execution time of query is more or less same when there was no index on the table1.
Any help would be appreciated.
Thanks

CUBE operation on large data set is really expensive, so you need to check do you really need all that data in inner query. because i see you are doing COUNT in inner and then on the outer query you have SUM of counts. so in other words, give me the row count of A7 at for all combination A1-A8 (-A7). then get only SUM for selected combinations filtered by WHERE clause. we can sure optimize this by limiting CUBE on certain column itself but very obvious things so far i have notice are as follows.
if you use below query and have right index o Table1 and Table_reject then both query can utilize the Index and reduce the data set needs to be join and further processed.
I am not 100% sure but yes Partial CUBE processing is possible and need to check that.
clustered index --> Table1 need on A8 And Table_Reject need clustered index on NAME.
non-clustered index--> Table1 need on A3,A9and Table_reject need on B3,B2
SELECT qry1.
(
SELECT A1, A2,A3,A4,A5,A6,A7,A8
FROM table1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
)qry1
LEFT JOIN
(
select B3,B2,ID
from table_reject
where name = 'smith'
)qry2
ON qry1.A3 = qry2.B3 and qry1.A9=qry2.B2
WHERE qry2.ID IS NULL
EDIT1:
I tried to find out what will be the difference in CUBE operator result if you do it on all Columns or you do it on only columns that you need it in result set. what I found is the way CUBE function works you do not need to perform CUBE on all columns. because at the end you just care about combinations generated by CUBE where A1 and A8 is NOT NULL.
Try this link and see the output.
enter link description here
Query 1 and Query2 is just inner most queries to compare the CUBE result set.
Query3 and Query4 is the same query that you are trying and you see the results are same in both case.
DECLARE #NEXT_DAY DATE = NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1,COUNT(A7) AS total,A8
FROM (
select a.a1,a.a7,a.a8
from table1 a
left join (select * from table_reject where name = 'smith') b
on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= #NEXT_DAY
GROUP BY
CUBE(A1,A8)
)INN
WHERE INN.A1 IS NOT NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
EDIT3
As I mentioned in the Comment this is a Round3 update. i can not change comment but i meant Edit3 instead Round3.
well the new change in your query is adding the WHERE A8 >= #NEXT_DAY condition in the inner most left join select where A8 >= #NEXT_DAY AND B.ID is null as well. that has improved the selection very much.
in your last comment you mentioned that query is taking 30-35 second and as you change the value of A8 it keep increasing. now with the execution time you didn't mentioned how much data is in the result set. why that is important? because if my query is returning 5M rows as a final result set that will going to spend 90% time in just droping that data on to UI, or output file what ever output method you are using. but actual performance should be measured how soon the query has started giving first couple of rows. because by that time Optimizer has already decided the execution paln and DB is executing that plan. yet i agree that if query is returning 100 rows and taking 10 seconds then something can be wrong with execution plan.
to demo that what I did is I created the dummy data. and perfomred your query against it.
i have table Test_CubeData with 9M rows in it with the same column numbers and data type you explained for your Table1. I have second table Table_Reject with 80K rows with number of columns and its datatype I figured out from query. To test the extreme side of this table; name column has only one value "smith" and ID is null for all 80K rows. so column values that can affect inner left join result will be B2 and B3.
in these tests i do not have any index on both tables. both are heap. and you see the results are in still few seconds with acceptable range of data in result set. as my result data set increases the completion time increases. if i create explained indexes then it will give me Index Seek operation
for all these tested cases. but at certain point that index will also exhaust and become Index Scan.
one sure example would be if my filter value for A8 column is smallest date value exist in that column. in that case Optimizer will see that all 9M rows need to be participate in inner select and CUBE and lot of data will be get processed in memory. which is expected. on the other hand lets see the another example of queries. i have unique 32873 values in A8 column and those values are almost equally distributed among 9M rows. so per single A8 values there are 260 to 300 rows. now if I execute the query for any single value smallest, largest, or any thing in between the query execution time should not change.
notice the highlighted text in each image below that indicated what the value of A8 filter is chosen,
important columns only in the select list instead using *, added A8 filter in the inner left join query, execution plan showing the TableScan operation on both table, query execution time in second,and total number of rows return by the query.
I hope that this will clear some doubts on performance of your query and will help you to set right expectation.
**Table Row Counts**
**TableScan_InnerLeftJoin**
**TableScan_FullQuery_248Rows**
**TableScan_FullQuery_5K**
**TableScan_FullQuery_56K**
**TableScan_FullQuery_480k**

You're calculating a potentially very large cube result on seven columns, and then discarding all the results except those that are logically just a group_by on column A1.
I suggest that you rewrite the query to just group by A1.

Related

SQLite order results by smallest difference

In many ways this question follows on from my previous one. I have a table that is pretty much identical
CREATE TABLE IF NOT EXISTS test
(
id INTEGER PRIMARY KEY,
a INTEGER NOT NULL,
b INTEGER NOT NULL,
c INTEGER NOT NULL,
d INTEGER NOT NULL,
weather INTEGER NOT NULL);
in which I would typically have entries such as
INSERT INTO test (a,b,c,d,weather) VALUES(1,2,3,4,30100306);
INSERT INTO test (a,b,c,d,weather) VALUES(1,2,3,4,30140306);
INSERT INTO test (a,b,c,d) VALUES(1,2,5,5,10100306);
INSERT INTO test (a,b,c,d) VALUES(1,5,5,5,11100306);
INSERT INTO test (a,b,c,d) VALUES(5,5,5,5,21101306);
Typically this table would have multiple rows with the some/all of b, c and d values being identical but with different a and weather values. As per the answer to my other question I can certainly issue
WITH cte AS (SELECT *, DENSE_RANK() OVER (ORDER BY (b=2) + (c=3) + (d=4) DESC) rn FROM test where a = 1) SELECT * FROM cte WHERE rn < 3;
No issues thus far. However, I have one further requirement which arises as a result of the weather column. Although this value is an integer it is in fact a composite where each digit represents a "banded" weather condition. Take for example weather = 20100306. Here 2 represents the wind direction divided up into 45 degree bands on the compass, 0 represents a wind speed range, 1 indicates precipitation as snow etc. What I need to do now while obtaining my ordered results is to allow for weather differences. Take for example the first two rows
INSERT INTO test (a,b,c,d,weather) VALUES(1,2,3,4,30100306);
INSERT INTO test (a,b,c,d,weather) VALUES(1,2,3,4,30140306);
Though otherwise similar they represent rather different weather conditions - the fourth number is four as opposed to 0 indicating a higher precipitation intensity brand. The WITH cte... above would rank the first two rows at the top which is fine. But what if I would rather have the row that differs the least from an incoming "weather condition" of 30130306? I would clearly like to have the second row appearing at the top. Once again, I can live with the "raw" result returned by WITH cte... and then drill down to the right row based on my current "weather condition" in Java. However, once again I find myself thinking that there is perhaps a rather neat way of doing this in SQL that is outwith my skill set. I'd be most obliged to anyone who might be able to tell me how/whether this can be done using just SQL.
You can sort the results 1st by DENSE_RANK() and 2nd by the absolute difference of weather and the incoming "weather condition":
WITH cte AS (
SELECT *,
DENSE_RANK() OVER (ORDER BY (b=2) + (c=3) + (d=4) DESC) rn
FROM test
WHERE a = 1
)
SELECT a,b,c,d,weather
FROM cte
WHERE rn < 3
ORDER BY rn, ABS(weather - ?);
Replace ? with the value of that incoming "weather condition".

How to get the total quantity of results using count(*)?

i need to get the total quantity of results for each person but i get ...
resultado
MY QUERY..
select t.fecha_hora_timbre,e.nombre,e.apellido,d.descripcion as departamento_trabaja, t.fecha,count(*)
from fulltime.timbre t, fulltime.empleado e, fulltime.departamento d
where d.depa_id=e.depa_id and t.codigo_empleado=e.codigo_empleado and
trunc(t.fecha) between trunc(to_date('15/02/2017','dd/mm/yyyy')) and trunc(to_date('14/03/2017','dd/mm/yyyy'))
group by t.fecha_hora_timbre,e.nombre,e.apellido,d.descripcion, t.fecha
Expected data...
NOMBRE | APELLIDO | DEPARTAMENTO_TRABAJA | VECES_MARCADAS(count)
MARIA TARCILA IGLESIAS BECERRA ALCALDIA 4
KATHERINE TATIANA SEGOVIA FERNANDEZ ALCALDIA 10
FREDDY AGUSTIN VALDIVIESO VALLEJO ALCALDIA 3
UPDATE..
select e.nombre,e.apellido,d.descripcion as departamento_trabaja,COUNT(*)
from fulltime.timbre t, fulltime.empleado e, fulltime.departamento d
where d.depa_id=e.depa_id and t.codigo_empleado=e.codigo_empleado and
trunc(t.fecha) between trunc(to_date('15/02/2017','dd/mm/yyyy')) and trunc(to_date('14/03/2017','dd/mm/yyyy'))
group by t.fecha_hora_timbre,e.nombre,e.apellido,d.descripcion, t.fecha
You should only select and group by the non-aggregate columns you actually want to count against. At the moment you're including the fecha_hora_timbre and fechacolumns in each row, so you're counting the unique combinations of those columns as well as the name/department information you actually want to count.
select e.nombre, e.apellido, d.descripcion as departamento_trabaja,
count(*) a veces_marcadas
from fulltime.timbre t
join fulltime.empleado e on t.codigo_empleado=e.codigo_empleado
join fulltime.departamento d on d.depa_id=e.depa_id
where t.fecha >= to_date('15/02/2017','dd/mm/yyyy')
and t.fecha < to_date('15/03/2017','dd/mm/yyyy')
group by e.nombre, e.apellido, d.descripcion
I've removed the extra columns. Notice that they have gone from both the select list and the group-by clause. If you have a non-aggregate column in the select list that isn't in the group-by you'll get an ORA-00937 error; but if you have a column in the group-by that isn't in the select list then it will still group by that even though you can't see it and you just won't get the results you expect.
I've also changed from old-style join syntax to modern syntax. And I've changed the date comparison; firstly because doing trunc() as part of trunc(to_date('15/02/2017','dd/mm/yyyy')) is pointless - you already know the time part is midnight, so the trunc doesn't achieve anything. But mostly so that if there is an index on fecha that index can be used. If you do trunc(f.techa) then the value of every column value has to be truncated, which stops the index being used (unless you have a function-based index). As between in inclusive, using >= and < with one day later on the higher limit should have the same effect overall.

improve complex query in Teradata

I would like to know how to get all the rows from table1 that have a matching row in table3.
Teh structure of the tables is:
table1:
k1 k2
table2:
k1 k2 t1 t2 date type
table3:
t1 t2 date status
The conditions are:
k1 and k2 have to match with the corresponding columns in table2.
In table2 I will only chek those rows where date='today' and type='a'.
That can return 0, 1 or many rows in table2.
Looking at t1 and t2 from table 2, I get the rows that match in table3.
If in table3 date='today' and status='ok', I will return the original row from table1, this is, k1 and k2.
How can I do this query (inner joins, exists, whatever) having into account that the three tables have millions of rows, so it must be as optimal as possible?
I have the query, which is right for sure, but they are too many conditions for Teradata to come with the answer. Too many joins, I think.
I would not consider three tables and a few millions of rows a complex query.
In Teradata you usually don't have to think that much about join/in/exists, all will be rewritten to joins internally. But there's is a one-to-many-to-one relation, so you should avoid a join as this will need a final DISTINCT.
Better use IN or EXISTS instead:
SELECT
K1,K2
FROM Table1
WHERE (K1,K2) IN
(
SELECT K1,K2
FROM Table2
WHERE datecol = CURRENT_DATE
AND typecol = 'a'
AND (T1,T2) IN
(
SELECT T1,T2
FROM Table3
WHERE datecol = CURRENT_DATE
AND status = 'ok'
)
)
Regarding the actual plan: if there are the necessary statistics the optimizer should choose a good plan, check the confidences levels in Explain. You can also run a diagnostic helpstats on for session; before running Explain to see if there are missing stats.
Something like the following should work.
SELECT
Table1.*
FROM
Table1
INNER JOIN Table2 ON
Table1.K1 = Table2.K1 AND
Table1.K2 = Table2.K2 AND
Table2.date = CURRENT_DATE and
Table2.type = 'a'
INNER JOIN Table3 ON
Table2.T1 = Table3.T1 AND
Table2.T2 = Table3.T2 AND
Table3.date = CURRENT_DATE and
Table3.status = "OK"
Update:
Speaking more to the optimization part of the question. The execution steps that Teradata will most likely take here are:
In parallel it will select all records from Table1, Records from Table2 where the date is CURRENT_DATE and the type is a, and Records from Table3 where the date is CURRENT_DATE and the status is OK.
It will then join the results from the SELECT of Table2 to the results of the SELECT from table1.
It will then join the results from that to the results from the SELECT of table3.
You can get more information by putting EXPLAIN before your SELECT query. The results returned from the database will be the explanation of your Teradata server will execute the query, which can be very enlightening when trying to optimize a big slow query.
Unfortunately the steps above are the best you can hope for. Parallel execution of all three tables with the filters applied, and then a join of the results. With big data, the slowest part of a query is often the join, so filtering before you get to that step is a big plus.
There's more that can be done to optimize like making sure your Indexes are in order and Collecting statistics, especially on fields where you will be filtering. But without the admin access to do that, your hands are tied.

Group by ranges in SQLite

I have a SQLite table which contains a numeric field field_name. I need to group by ranges of this column, something like this: SELECT CAST(field_name/100 AS INT), COUNT(*) FROM table GROUP BY CAST(field_name/100 AS INT), but including ranges which have no value (COUNT for them should be 0). And I can't get how to perform such a query?
You can do this by using a join and (though kludgy) an extra table.
The extra table would contain each of the values you want a row for in the response to your query (this would not only fill in missing CAST(field_name/100 AS INT) values between your returned values, but also let you expand it such that if your current groups were 5, 6, 7 you could include 0 through 10.
In other flavors of SQL you'd be able to right join or full outer join, and you'd be on your way. Alas, SQLite doesn't offer these.
Accordingly, we'll use a cross join (join everything to everything) and then filter. If you've got a relatively small database or a small number of groups, you're in good shape. If you have large numbers of both, this will be a very intensive way to go about this (the cross join result will have #ofRowsOfData * #ofGroups rows, so watch out).
Example:
TABLE: groups_for_report
desired_group
-------------
0
1
2
3
4
5
6
Table: data
fieldname other_field
--------- -----------
250 somestuff
230 someotherstuff
600 stuff
you would use a query like
select groups_for_report.desired_group, count(data.fieldname)
from data
cross join groups_for_report
where CAST(fieldname/100.0 AS INT)=desired_group
group by desired_group;

Fastest Way to Count Distinct Values in a Column, Including NULL Values

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.
This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.
This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).
What is the most efficient way to do this?
Update After Testing
I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest
Count/GroupBy (Cheran)
CountDistinct+SubQuery (Ellis)
dense_rank (Eriksson)
Count+Max (Andriy)
Testing Results (in seconds):
Method 20_Columns 5_Columns 1_Column (Large) 1_Column (Small)
1) Count/GroupBy 10.8 4.8 2.8 0.14
2) CountDistinct 12.4 4.8 3 0.7
3) dense_rank 226 30 6 4.33
4) Count+Max 98.5 44 16 12.5
Notes:
Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.
Thanks for the help and suggestions!
SELECT COUNT(*)
FROM (SELECT ColumnName
FROM TableName
GROUP BY ColumnName) AS s;
GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.
I think you should try to keep the number of table scans down and count all columns in one table in one go. Something like this could be worth trying.
;with C as
(
select dense_rank() over(order by Col1) as dnCol1,
dense_rank() over(order by Col2) as dnCol2
from YourTable
)
select max(dnCol1) as CountCol1,
max(dnCol2) as CountCol2
from C
Test the query at SE-Data
A development on OP's own solution:
SELECT
COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable
Run one query that Counts the number of Distinct values and adds 1 if there are any NULLs in the column (using a subquery)
Select Count(Distinct COLUMNNAME) +
Case When Exists
(Select * from TABLENAME Where COLUMNNAME is Null)
Then 1 Else 0 End
From TABLENAME
You can try:
count(
distinct coalesce(
your_table.column_1, your_table.column_2
-- cast them if you want replace value from column are not same type
)
) as COUNT_TEST
Function coalesce help you combine two columns with replace not null values.
I used this in mine case and success with correctly result.
Not sure this would be the fastest but might be worth testing. Use case to give null a value. Clearly you would need to select a value for null that would not occur in the real data. According to the query plan this would be a dead heat with the count(*) (group by) solution proposed by Cheran S.
SELECT
COUNT( distinct
(case when [testNull] is null then 'dbNullValue' else [testNull] end)
)
FROM [test].[dbo].[testNullVal]
With this approach can also count more than one column
SELECT
COUNT( distinct
(case when [testNull1] is null then 'dbNullValue' else [testNull1] end)
),
COUNT( distinct
(case when [testNull2] is null then 'dbNullValue' else [testNull2] end)
)
FROM [test].[dbo].[testNullVal]

Resources