To identify duplicates in a 5.5 Billion record Teradata table - teradatasql

I have 5.5 Billion records in a Teradata table. The table has 114 columns. What's the best way to find exact duplicate rows? I need to get the duplicate records inserted in another table for further processing.
Thanks

Related

Updating huge tables in Teradata

I am trying to Update a huge table in Teradata on a daily basis. The Update Statement is taking a lot of AMPCPUTime.
Table contains 65 Billion rows and 100-200 Million Rows are updated.
Table is a Set Table with Non Unique PI. The data distribution is quite even with 0.8 Skew Factor.
What is the way to reduce the AMPCPU Time?
The Update is done using a Stage table. Join is on a subset of PI columns.
Attempts: Changed the PI of stage table same as Target Table. Explain PLan says a Merge Update is being performed. But AMPCPUTime is rather increasing.
Tried Delete and Insert but Delete and Insert also taking greater AMPCPUTime.

How long does it cost to retrieve all rows from a SQLite table?

If I have one millions rows in a SQLite table, and I want to retrieve all of them. How long will this cost? Will it cost one millions times than only retrieve on row? If not, what is the correct way to retrieve all rows?

Set table vs multi set table performance

I have to prepare a table where I will keep weekly results for some aggregated data. Table will have 30 fields (10 CHARACTERs, 20 DECIMALs), I think I will have 250k rows weekly.
In my head I can see two scenarios:
Set table and relying on teradata in preventing duplicate rows - it should skip duplicate entries while inserting new data
Multi set table with UPI - it will give an error upon inserting duplicate row.
INSERT statement is going to be executed through VBA on excel, where handling possible teradata errors is not a problem.
Which scenario will be faster to run in a year time where there will be circa 14 millions rows
Is there any other way to have it done?
Regards
On a high level, since you would be having a comparatively high data count on your table, it is advisable not to use SET tables, rather go with the multiset table.
For more info you can refer to this link
http://www.dwhpro.com/teradata-multiset-tables/
Why do you care about Duplicate Rows? When you store weekly aggregates there should be no duplicates at all. And Duplicate Rows are not the same as duplicate Primary Key values.
Simply choose a PI which fits best your join/access pattern (maybe partition by date). To avoid any potential duplicates you might simply use MERGE instead of INSERT.

What's the number in SQLite query plan mean?

I output the query plan on SQLite, and it shows
0|0|0|SCAN TABLE t (~500000 rows)
I wonder what is the meaning of the number (500000)? I guess it is the table length, but I executed the query on a small table which does not have so many rows.
Is there any official document about the meaning of the number? thanks.
As the official documentation says, this is the number of rows that the database estimates will be returned.
If there is an index on a seached column, and if you have run ANALYZE, then SQLite can make an estimate based on the actual data. Otherwise, it assumes that tables contain one million rows, and that a search like column > x filters out half the rows.

Bulk Collect and FOR loop when all the values for insert DML is not available

I want to insert 150K records from a source table into a destination table. The problem is that i have to calculate some values for the destination table too.
How should i use the bulk collect and for statement for the INSERT DML.
Please find the elaborate explanation below.
Source Table
Account_id | Status1| Status2
Table 1
Account_id | column2| column3| column4|column6|column7
Table 2
Account_id | column2| column3| column6|column9|column10
NOw I have to fetch the values from table 1 for the account_ids matching the source table and insert into table 2 where i have to populate column9 and column 10 dynamically.
BULK COLLECT requires a lot of memory and in practice is only feasible if you process your data in chunks, i.e. about 1000 rows at a time. Otherwise the memory consumption will be too much for most systems.
The best option is usually to create a single INSERT .. SELECT statement that retrieves, calculates and inserts all data at once.
If this is not possible or far too complex, the second best option in my opinion is a pipelined function written in PL/SQL.
The third best and usually easiest option is a simple PL/SQL loop that select row by row, calculates the required data and inserts it row by row. Performance wise it's usually the worst. But it can be still more than sufficient.
For more precise answers, you need to specify the exact problem at hand. Your question is rather broad.

Resources