I would like to know how to get all the rows from table1 that have a matching row in table3.
Teh structure of the tables is:
table1:
k1 k2
table2:
k1 k2 t1 t2 date type
table3:
t1 t2 date status
The conditions are:
k1 and k2 have to match with the corresponding columns in table2.
In table2 I will only chek those rows where date='today' and type='a'.
That can return 0, 1 or many rows in table2.
Looking at t1 and t2 from table 2, I get the rows that match in table3.
If in table3 date='today' and status='ok', I will return the original row from table1, this is, k1 and k2.
How can I do this query (inner joins, exists, whatever) having into account that the three tables have millions of rows, so it must be as optimal as possible?
I have the query, which is right for sure, but they are too many conditions for Teradata to come with the answer. Too many joins, I think.
I would not consider three tables and a few millions of rows a complex query.
In Teradata you usually don't have to think that much about join/in/exists, all will be rewritten to joins internally. But there's is a one-to-many-to-one relation, so you should avoid a join as this will need a final DISTINCT.
Better use IN or EXISTS instead:
SELECT
K1,K2
FROM Table1
WHERE (K1,K2) IN
(
SELECT K1,K2
FROM Table2
WHERE datecol = CURRENT_DATE
AND typecol = 'a'
AND (T1,T2) IN
(
SELECT T1,T2
FROM Table3
WHERE datecol = CURRENT_DATE
AND status = 'ok'
)
)
Regarding the actual plan: if there are the necessary statistics the optimizer should choose a good plan, check the confidences levels in Explain. You can also run a diagnostic helpstats on for session; before running Explain to see if there are missing stats.
Something like the following should work.
SELECT
Table1.*
FROM
Table1
INNER JOIN Table2 ON
Table1.K1 = Table2.K1 AND
Table1.K2 = Table2.K2 AND
Table2.date = CURRENT_DATE and
Table2.type = 'a'
INNER JOIN Table3 ON
Table2.T1 = Table3.T1 AND
Table2.T2 = Table3.T2 AND
Table3.date = CURRENT_DATE and
Table3.status = "OK"
Update:
Speaking more to the optimization part of the question. The execution steps that Teradata will most likely take here are:
In parallel it will select all records from Table1, Records from Table2 where the date is CURRENT_DATE and the type is a, and Records from Table3 where the date is CURRENT_DATE and the status is OK.
It will then join the results from the SELECT of Table2 to the results of the SELECT from table1.
It will then join the results from that to the results from the SELECT of table3.
You can get more information by putting EXPLAIN before your SELECT query. The results returned from the database will be the explanation of your Teradata server will execute the query, which can be very enlightening when trying to optimize a big slow query.
Unfortunately the steps above are the best you can hope for. Parallel execution of all three tables with the filters applied, and then a join of the results. With big data, the slowest part of a query is often the join, so filtering before you get to that step is a big plus.
There's more that can be done to optimize like making sure your Indexes are in order and Collecting statistics, especially on fields where you will be filtering. But without the admin access to do that, your hands are tied.
Related
I'm working through this exercise.
On question 4, the goal is to find employees hired after "Jones". I think this problem can be solved without a join like so:
SELECT first_name, last_name, hire_date
FROM employees
WHERE hire_date > (
SELECT hire_date FROM employees WHERE last_name = "Jones"
)
But the answer on the website suggests:
SELECT e.first_name, e.last_name, e.hire_date
FROM employees e
JOIN employees davies
ON (davies.last_name = "Jones")
WHERE davies.hire_date < e.hire_date;
Are these more-or-less the same or is there a reason the second answer should be considered better?
I assume that the column last_name is defined as UNIQUE, so that the subquery in the 1st query returns only 1 row.
If not, then the queries do not return the same results, because although the subquery in the 1st query may return more than 1 row (and in other databases the query would not even run), SQLite will pick just the 1st of the returned rows and use its hire_date to compare it with all the rows of the table, while the join will use all the rows where last_name = "Jones".
If my assumption is correct then the 2 queries are equivalent, but the 1st one is what I would suggest because it is more readable and I believe it would perform better than the join.
If I had to use a join for this requirement (since it is homework) I would choose a more readable form:
SELECT e.first_name, e.last_name, e.hire_date
FROM employees e
JOIN (SELECT * FROM employees WHERE last_name = "Jones") t
ON t.hire_date < e.hire_date;
I have two tables. Config and Data. Config table has info to define what I call "Predefined Points". The columns are configId, machineId, iotype, ioid, subfield and predeftype. I have a second table that contains all the data for all the items in the config table linked by configId. Data table contains configId, timestamp, value.
I am trying to return each row from the config table with 2 new columns in the result which would be min timestamp of this particular predefined point and max timestamp of this particular predefined point.
Pseudocode would be
select a.*, min(b.timestamp), max(b.timestamp) from TrendConfig a join TrendData b on a.configId = b.configId where configId = (select configId from TrendConfig)
Where the subquery would return multiple values.
Any idea how to formulate this?
Try an inner join:
select a.*, b.min(timestamp), b.max(timestamp)
from config a
inner join data b
on a.configId = b.configID
I was able to find an answer using: Why can't you mix Aggregate values and Non-Aggregate values in a single SELECT?
The solution was indeed GROUP BY as CL mentioned above.
select a.*, min(b.timestamp), max(b.timestamp) from TrendConfig a join TrendData b on a.configId = b.configId group by a.configId
Can some one please help in solving my problem
I have three tables to be joined ed using indexes in Teradata to improve performance. Query specified below:-
Select b.Id, b.First_name, b.Last_name, c. Id,
c.First_name, c.Last_name, c.Result
from
(
select a.Id, a.First_name, a. Last_name, a.Approver1, a.Approver2
From table1 a
Inner join table2 d
On a.Id =D.Id
and A.Approver1 =a.Approver1
And a.Approve2 =D.Approver2
) b
Left join
(
select * from table3
where result is not null
and application like 'application1'
) c
On c. Id=b.Id
Group by b.Id, b.First_name, b.Last_name, c.Id,
c.First_name, c.Last_name, c.Result
The above query is taking so much of time since PI not defined correctly.
First two tables (table1 and 2) are with same set of columns hence pi can be defined like PI on I'd, approve1, approve2
However, while joining with table3 am confused and need to understand how to define pi. Is it something that PI can only work when we have same set of columns in the tables?
Structure of table3 is
I'd, first name, last name, result
And table 1 and table2
Id , First name, Last name, Approved 1, Approved 2, Results
Can you please help in defining primary indexes so that query can be optimised.
Teradata will usually not use Secondary Indexes for joins. The best PI would be id for all three tables, of course you need to check if there are not too many rows per value and it's not too skewed.
GROUP BY can be simplified to a DISTINCT, why do you need it, can you show the Primary Keys of those tables?
Edit based on comment:
PI-based joins are by far the fastest way. But you should be able the get rid of the DISTINCT, too, it's always a huge overhead.
Try replacing the 1st join with a NOT EXISTS:
Select b.Id, b.First_name, b.Last_name, c. Id,
c.First_name, c.Last_name, c.Result
from
(
select a.Id, a.First_name, a. Last_name, a.Approver1, a.Approver2
From table1 a
WHERE EXISTS
(
SELECT *
FROM table2 d
WHERE a.Id =D.Id
and A.Approver1 =a.Approver1
And a.Approve2 =D.Approver2
)
) b
Left join
(
select * from table3
where result is not null
and application like 'application1'
) c
On c. Id=b.Id
I have following query :
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1, A2,A3,A4,A5,A6,COUNT(A7) AS total,A8
FROM (
select a.* from table1 a
left join (select * from table_reject where name = 'smith') b on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
GROUP BY
CUBE(A1, A2,A3,A4,A5,A6,A8)
)INN
WHERE
INN.A1 IS NOT NULL AND
INN.A2 IS NULL AND
INN.A3 IS NULL AND
INN.A4 IS NULL AND
INN.A5 IS NULL AND
INN.A6 is NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
Total number records in table1 is around 8 million.
My problem is i need to optimize the above query in best possible way.I did tried to make index on column A8 of table1 and creating the index helped me to decrease the cost of query but execution time of query is more or less same when there was no index on the table1.
Any help would be appreciated.
Thanks
CUBE operation on large data set is really expensive, so you need to check do you really need all that data in inner query. because i see you are doing COUNT in inner and then on the outer query you have SUM of counts. so in other words, give me the row count of A7 at for all combination A1-A8 (-A7). then get only SUM for selected combinations filtered by WHERE clause. we can sure optimize this by limiting CUBE on certain column itself but very obvious things so far i have notice are as follows.
if you use below query and have right index o Table1 and Table_reject then both query can utilize the Index and reduce the data set needs to be join and further processed.
I am not 100% sure but yes Partial CUBE processing is possible and need to check that.
clustered index --> Table1 need on A8 And Table_Reject need clustered index on NAME.
non-clustered index--> Table1 need on A3,A9and Table_reject need on B3,B2
SELECT qry1.
(
SELECT A1, A2,A3,A4,A5,A6,A7,A8
FROM table1
WHERE A8 >= NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
)qry1
LEFT JOIN
(
select B3,B2,ID
from table_reject
where name = 'smith'
)qry2
ON qry1.A3 = qry2.B3 and qry1.A9=qry2.B2
WHERE qry2.ID IS NULL
EDIT1:
I tried to find out what will be the difference in CUBE operator result if you do it on all Columns or you do it on only columns that you need it in result set. what I found is the way CUBE function works you do not need to perform CUBE on all columns. because at the end you just care about combinations generated by CUBE where A1 and A8 is NOT NULL.
Try this link and see the output.
enter link description here
Query 1 and Query2 is just inner most queries to compare the CUBE result set.
Query3 and Query4 is the same query that you are trying and you see the results are same in both case.
DECLARE #NEXT_DAY DATE = NEXT_DAY ( trunc(to_date('17/09/2013 12:00:00','dd/mm/yyyy hh24:mi:ss')) ,'SUN' )
SELECT distinct A1 ,sum(total) as sum_total FROM
(
SELECT A1,COUNT(A7) AS total,A8
FROM (
select a.a1,a.a7,a.a8
from table1 a
left join (select * from table_reject where name = 'smith') b
on A.A3 = B.B3 and A.A9 =B.B2
where B.ID is null
) t1
WHERE A8 >= #NEXT_DAY
GROUP BY
CUBE(A1,A8)
)INN
WHERE INN.A1 IS NOT NULL AND
INN.A8 IS NOT NULL
GROUP BY A1
ORDER BY sum_total DESC ;
EDIT3
As I mentioned in the Comment this is a Round3 update. i can not change comment but i meant Edit3 instead Round3.
well the new change in your query is adding the WHERE A8 >= #NEXT_DAY condition in the inner most left join select where A8 >= #NEXT_DAY AND B.ID is null as well. that has improved the selection very much.
in your last comment you mentioned that query is taking 30-35 second and as you change the value of A8 it keep increasing. now with the execution time you didn't mentioned how much data is in the result set. why that is important? because if my query is returning 5M rows as a final result set that will going to spend 90% time in just droping that data on to UI, or output file what ever output method you are using. but actual performance should be measured how soon the query has started giving first couple of rows. because by that time Optimizer has already decided the execution paln and DB is executing that plan. yet i agree that if query is returning 100 rows and taking 10 seconds then something can be wrong with execution plan.
to demo that what I did is I created the dummy data. and perfomred your query against it.
i have table Test_CubeData with 9M rows in it with the same column numbers and data type you explained for your Table1. I have second table Table_Reject with 80K rows with number of columns and its datatype I figured out from query. To test the extreme side of this table; name column has only one value "smith" and ID is null for all 80K rows. so column values that can affect inner left join result will be B2 and B3.
in these tests i do not have any index on both tables. both are heap. and you see the results are in still few seconds with acceptable range of data in result set. as my result data set increases the completion time increases. if i create explained indexes then it will give me Index Seek operation
for all these tested cases. but at certain point that index will also exhaust and become Index Scan.
one sure example would be if my filter value for A8 column is smallest date value exist in that column. in that case Optimizer will see that all 9M rows need to be participate in inner select and CUBE and lot of data will be get processed in memory. which is expected. on the other hand lets see the another example of queries. i have unique 32873 values in A8 column and those values are almost equally distributed among 9M rows. so per single A8 values there are 260 to 300 rows. now if I execute the query for any single value smallest, largest, or any thing in between the query execution time should not change.
notice the highlighted text in each image below that indicated what the value of A8 filter is chosen,
important columns only in the select list instead using *, added A8 filter in the inner left join query, execution plan showing the TableScan operation on both table, query execution time in second,and total number of rows return by the query.
I hope that this will clear some doubts on performance of your query and will help you to set right expectation.
**Table Row Counts**
**TableScan_InnerLeftJoin**
**TableScan_FullQuery_248Rows**
**TableScan_FullQuery_5K**
**TableScan_FullQuery_56K**
**TableScan_FullQuery_480k**
You're calculating a potentially very large cube result on seven columns, and then discarding all the results except those that are logically just a group_by on column A1.
I suggest that you rewrite the query to just group by A1.
I am trying to update Table B of a database looking like this:
Table A:
id, amount, date, b_id
1,200,6/31/2012,1
2,300,6/31/2012,1
3,400,6/29/2012,2
4,200,6/31/2012,1
5,200,6/31/2012,2
6,200,6/31/2012,1
7,200,6/31/2012,2
8,200,6/31/2012,2
Table B:
id, b_amount, b_date
1,0,0
2,0,0
3,0,0
Now with this query I get all the data I need in one select:
SELECT A.*,B.* FROM A LEFT JOIN B ON B.id=A.b_id WHERE A.b_id>0 GROUP BY B.id
id, amount, date, b_id, id, b_amount, b_date
1,200,6/31/2012,1,1,0,0
3,400,6/29/2012,1,1,0,0
Now, I just want to copy the selected column amount to b_amount and date to b_date
b_amount=amount, b_date=date
resulting in
id, amount, date, b_id, id, b_amount, b_date
1,200,6/31/2012,1,1,200,6/31/2012
3,400,6/29/2012,1,1,400,6/29/2012
I've tried COALESCE() without success.
Does someone experienced have a solution for this?
Solution:
Thanks to the answers below, I managed to come up with this. It is probably not the most efficient way but it is fine for a one time only update. This will insert for you the first corresponding entry of each group.
REPLACE INTO A SELECT id, amount, date FROM
(SELECT A.id, A.amount, B.id as Bid FROM A INNER JOIN B ON (B.id=A.B_id)
ORDER BY A.id DESC)
GROUP BY Bid;
So what you are looking for seems to be a JOIN inside of an UPDATE query. In mySQL you would use
UPDATE B INNER JOIN A ON B.id=A.b_id SET B.amount=A.amount, B.date=A.date;
but this is not supported by sqlite as this probably related question points out. However, there is a workaround using REPLACE:
REPLACE INTO B
SELECT B.id, A.amount, A.date FROM A
LEFT JOIN B ON B.id=A.b_id
WHERE A.b_id>0 GROUP BY B.id;
The query will simply fill in the values of table B for all columns which should keep their state and fill in the values of table A for the copied values. Make sure the order of the columns in the SELECT statement meet your column order of table B and all columns are mentioned or you will loose these field's data. This is probably dangerous for future changes on table B. So keep in mind to change the column order/presence of this query when changing table B.
Something a bit off topic, because you did not ask for that: A.b_id is obviously a foreign key to B.id. It seems you are using the value 0 for the foreign key to express that there is no corresponding entry in B. (Inferred from your SELECT with WHERE A.b_id>0.) You should consider using the null value for that. When you are using INNER JOIN then instead of LEFT JOIN you can drop the WHERE clause entirely. The DBS will then sort out all unsatisfied relations.
WARNING Some RDBMS will return 2 rows as you show above. Others will return the Cartesian product of the rows i.e. A rows times B rows.
One tricky method is to generate SQL that is then executed
SELECT "update B set b.b_amount = ", a.amount, ", b.b_date = ", a.date,
" where b.id = ", a.b_id
FROM A LEFT JOIN B ON B.id=A.b_id WHERE A.b_id>0 GROUP BY B.id
Now add the batch terminator and execute this SQL. The query result should look like this
update B set b.b_amount = 200, b.b_date = 6/31/2012 where b.id = 1
update B set b.b_amount = 400, b.b_date = 6/29/2012 where b.id = 3
NOTE: Some RDBMS will handle dates differently. Some require quotes.