Trying to run a initial bulk load into a Multi Set Table in a split of 10 Insert SQL's based on MOD on Identifier Column. The first and second insert is running but third is failing due to High CPU Skew.
The DBQLOGTBL shows the first SQL took about 10% CPU. The second took 30% and the third was taking 50% CPU hence failed.
The number of records being loaded in each is roughly same. The step which is failing as per the Explain plan is when TD does a MERGE to the main table using Spool.
What could be the solution to solve the problem?
Table is a MULTI SET with NUPI
Partition on a Date Column
Post Initial Load data volume will be 6 TB so roughly 600 GB is being inserted in 10 splits
Related
I have an AS tabular model that contains a fact table with 20 mil rows. I have partitioned this so only the new rows get added to each day... however occasionally, a historical row (from years ago) will be modified. I can identify this modified row in SQL (using the last modified timestamp) however would it be possible for me to refresh the row in SSAS to reflect this change without having to refresh my entire data model? How would I achieve this?
First, 20 million rows is not a lot. I’m expecting that will only take 5-10 minutes to process unless your SQL queries are very inefficient or very wide. So why bother to optimize something which may be fast enough already?
If you do need to optimize it, you will first want to partition the large fact table by some date element. Since you only have 20 million rows I would suggest partitioning by year. Optimal compression will be achieved with around 8 million rows per partition. Over-partitioning (such as creating thousands of daily partitions) is counter-productive.
When a new row is added you could perform a ProcessAdd to insert just the new records to the partitions in question. However I would recommend just doing a ProcessFull on any year partitions which have any inserts, updates or deletes in SQL.
SSAS doesn’t support updating a specific row. Thus you have to follow the ProcessFull advice above.
There are several code examples including this one which may help you.
Again this may be overkill if you only have 20 million rows.
I am trying to Update a huge table in Teradata on a daily basis. The Update Statement is taking a lot of AMPCPUTime.
Table contains 65 Billion rows and 100-200 Million Rows are updated.
Table is a Set Table with Non Unique PI. The data distribution is quite even with 0.8 Skew Factor.
What is the way to reduce the AMPCPU Time?
The Update is done using a Stage table. Join is on a subset of PI columns.
Attempts: Changed the PI of stage table same as Target Table. Explain PLan says a Merge Update is being performed. But AMPCPUTime is rather increasing.
Tried Delete and Insert but Delete and Insert also taking greater AMPCPUTime.
I am doing some scalability testing for one of my programs. As part of this I have to insert millions of records into a dynamodb table take measurements and then rerun the test with different parameters. I need to start with an empty table for each run. Deleting each record from the table takes too much time and so I am deleting the table and recreating it. The table has a GSI which I am creating before each run and after I have set auto-scaling properties on that table it takes upto 1 hour in order for the table to be ready. What could be going wrong?
We are planning to archive older data from some tables. But before doing so we have to estimate how much space will we gain once we purge older records.
For Example, suppose we have an ORDERS table which is consuming 5Gb of space on disk. We have more than 15 Million records in this table. We are interested in keeping records after 2010. When we query for records before 2010, we have got approx 12 Million of records and we are in planning of archiving and purging these records.
We have to first calculate how much free space will we gain when we remove these 12 million records. How can we calculate space consumed by such selected records.
One way which i can thought of is by creating a new table for these 12 Million old records and then calculate its segment size.
Please suggest if we can still calculate space of the selected records in much better way. Thanks.
To calculate the space of the selected records, you can try as below:
step 1 :
scott#dev8i> analyze table orders compute statistics;
Table analyzed.
OR as per Ben sugesstion
scott#dev8i> EXEC DBMS_STATS.gather_table_stats('<SHCEMA>', 'ORDERS');
Step 2:
scott#dev8i> select num_rows * avg_row_len
from dba_tables
where table_name = 'ORDERS';
NUM_ROWS*AVG_ROW_LEN
--------------------
560 ---This is the total table size.
The result of the query shows that the Orders table is using 560 bytes of the total bytes allocated to it.
Since you want how much space is allocated to 12 million records, then you just need to replace num_rows with 12000000. The result will be the approximate figure in bytes.
We've got a simple query on a table with 7000 records. The query hits around 1,400 records.
When I press Run Statement it takes < 1Sec to execute.
When I press Run Script (F5) it takes 22 seconds. Is this normal or is this a network problem?
When you use Run the data grid only fetches the first 50 rows (by default). As you page down it does further batch fetches, which take additional time. It will repeat that 29 times to get all 1400 rows, but on demand.
When you Run Script all of the rows (up to whatever limit you have set; I think the default is still 5000) are fetched in one go - so it is spending more time getting the data across the network straight away.
If you end up looking at all of the rows in the data grid then the amount of time spent transferring the data from the DB server to your PC is the same, it's just split into chunks; and there may be more overhead overall from the extra trips (depending on how many rows are fetched over JDBC in each batch anyway). If you page down right to the bottom of the grid, watch the 'fetched rows' display at the top of the grid; and when it gets to the end compare are 'All rows fetched in ...' time with the script run.