Updating huge tables in Teradata - teradata

I am trying to Update a huge table in Teradata on a daily basis. The Update Statement is taking a lot of AMPCPUTime.
Table contains 65 Billion rows and 100-200 Million Rows are updated.
Table is a Set Table with Non Unique PI. The data distribution is quite even with 0.8 Skew Factor.
What is the way to reduce the AMPCPU Time?
The Update is done using a Stage table. Join is on a subset of PI columns.
Attempts: Changed the PI of stage table same as Target Table. Explain PLan says a Merge Update is being performed. But AMPCPUTime is rather increasing.
Tried Delete and Insert but Delete and Insert also taking greater AMPCPUTime.

Related

Azure Analysis Service - partition to refresh modified rows only?

I have an AS tabular model that contains a fact table with 20 mil rows. I have partitioned this so only the new rows get added to each day... however occasionally, a historical row (from years ago) will be modified. I can identify this modified row in SQL (using the last modified timestamp) however would it be possible for me to refresh the row in SSAS to reflect this change without having to refresh my entire data model? How would I achieve this?
First, 20 million rows is not a lot. I’m expecting that will only take 5-10 minutes to process unless your SQL queries are very inefficient or very wide. So why bother to optimize something which may be fast enough already?
If you do need to optimize it, you will first want to partition the large fact table by some date element. Since you only have 20 million rows I would suggest partitioning by year. Optimal compression will be achieved with around 8 million rows per partition. Over-partitioning (such as creating thousands of daily partitions) is counter-productive.
When a new row is added you could perform a ProcessAdd to insert just the new records to the partitions in question. However I would recommend just doing a ProcessFull on any year partitions which have any inserts, updates or deletes in SQL.
SSAS doesn’t support updating a specific row. Thus you have to follow the ProcessFull advice above.
There are several code examples including this one which may help you.
Again this may be overkill if you only have 20 million rows.

DynamoDB table creation is too slow

I am doing some scalability testing for one of my programs. As part of this I have to insert millions of records into a dynamodb table take measurements and then rerun the test with different parameters. I need to start with an empty table for each run. Deleting each record from the table takes too much time and so I am deleting the table and recreating it. The table has a GSI which I am creating before each run and after I have set auto-scaling properties on that table it takes upto 1 hour in order for the table to be ready. What could be going wrong?

How best to efficiently extract data from a large SQLite database?

I am using SQlite to store a large amount of data and am having troubles extracting that data using very simple queries. At the moment, my database is just one table, with about 50million rows and 15 columns. I would like to extract one complete column from this table.
I have tried using RSQlite: dbGetQuery(db, ‘select qs from CSI’) where qs and CSI are my column and table names respectively. Qs are character strings. This query runs for hours before I give up (R version 3.3.3, RSQLite_1.1-2).
I also tried the DB Browser for SQLite (v3.9.1), using the same query and again gave up after a few hours run time. I do not have an IDKey/indexing, but I thought since I want the entire column, this should not have any impact.
I am running on a 64bit Windows machine with 16GB Ram. How can I extract columns from my table within a reasonable time? Or is there a better way I should be storing my data for easy access?
To get a column value, SQLite has to read the row up to the column. So to get the values from all rows, it has to read practically everything.
With an index on this column, you would have a covering index that would reduce the amount of data to be read from disk.
If you do not actually need multiple values from the same row, consider storing the columns in different tables, or using a different database.

Set table vs multi set table performance

I have to prepare a table where I will keep weekly results for some aggregated data. Table will have 30 fields (10 CHARACTERs, 20 DECIMALs), I think I will have 250k rows weekly.
In my head I can see two scenarios:
Set table and relying on teradata in preventing duplicate rows - it should skip duplicate entries while inserting new data
Multi set table with UPI - it will give an error upon inserting duplicate row.
INSERT statement is going to be executed through VBA on excel, where handling possible teradata errors is not a problem.
Which scenario will be faster to run in a year time where there will be circa 14 millions rows
Is there any other way to have it done?
Regards
On a high level, since you would be having a comparatively high data count on your table, it is advisable not to use SET tables, rather go with the multiset table.
For more info you can refer to this link
http://www.dwhpro.com/teradata-multiset-tables/
Why do you care about Duplicate Rows? When you store weekly aggregates there should be no duplicates at all. And Duplicate Rows are not the same as duplicate Primary Key values.
Simply choose a PI which fits best your join/access pattern (maybe partition by date). To avoid any potential duplicates you might simply use MERGE instead of INSERT.

What's the number in SQLite query plan mean?

I output the query plan on SQLite, and it shows
0|0|0|SCAN TABLE t (~500000 rows)
I wonder what is the meaning of the number (500000)? I guess it is the table length, but I executed the query on a small table which does not have so many rows.
Is there any official document about the meaning of the number? thanks.
As the official documentation says, this is the number of rows that the database estimates will be returned.
If there is an index on a seached column, and if you have run ANALYZE, then SQLite can make an estimate based on the actual data. Otherwise, it assumes that tables contain one million rows, and that a search like column > x filters out half the rows.

Resources