We are planning to archive older data from some tables. But before doing so we have to estimate how much space will we gain once we purge older records.
For Example, suppose we have an ORDERS table which is consuming 5Gb of space on disk. We have more than 15 Million records in this table. We are interested in keeping records after 2010. When we query for records before 2010, we have got approx 12 Million of records and we are in planning of archiving and purging these records.
We have to first calculate how much free space will we gain when we remove these 12 million records. How can we calculate space consumed by such selected records.
One way which i can thought of is by creating a new table for these 12 Million old records and then calculate its segment size.
Please suggest if we can still calculate space of the selected records in much better way. Thanks.
To calculate the space of the selected records, you can try as below:
step 1 :
scott#dev8i> analyze table orders compute statistics;
Table analyzed.
OR as per Ben sugesstion
scott#dev8i> EXEC DBMS_STATS.gather_table_stats('<SHCEMA>', 'ORDERS');
Step 2:
scott#dev8i> select num_rows * avg_row_len
from dba_tables
where table_name = 'ORDERS';
NUM_ROWS*AVG_ROW_LEN
--------------------
560 ---This is the total table size.
The result of the query shows that the Orders table is using 560 bytes of the total bytes allocated to it.
Since you want how much space is allocated to 12 million records, then you just need to replace num_rows with 12000000. The result will be the approximate figure in bytes.
Related
I have an AS tabular model that contains a fact table with 20 mil rows. I have partitioned this so only the new rows get added to each day... however occasionally, a historical row (from years ago) will be modified. I can identify this modified row in SQL (using the last modified timestamp) however would it be possible for me to refresh the row in SSAS to reflect this change without having to refresh my entire data model? How would I achieve this?
First, 20 million rows is not a lot. I’m expecting that will only take 5-10 minutes to process unless your SQL queries are very inefficient or very wide. So why bother to optimize something which may be fast enough already?
If you do need to optimize it, you will first want to partition the large fact table by some date element. Since you only have 20 million rows I would suggest partitioning by year. Optimal compression will be achieved with around 8 million rows per partition. Over-partitioning (such as creating thousands of daily partitions) is counter-productive.
When a new row is added you could perform a ProcessAdd to insert just the new records to the partitions in question. However I would recommend just doing a ProcessFull on any year partitions which have any inserts, updates or deletes in SQL.
SSAS doesn’t support updating a specific row. Thus you have to follow the ProcessFull advice above.
There are several code examples including this one which may help you.
Again this may be overkill if you only have 20 million rows.
I need to create a table with the following fields :
place, date, status
My keys are parition key - place , sort key - date
Status can be either 0 or 1
Table has approximately 300k rows per day and about 3 days worth of data at any given time, so about 1 million rows. I have a service that is continuously populating data to this DDB.
I need to run the following queries (only) once per day :
#1 Return count of all places with date = current_date-1
#2 Return count and list of all places with date= current_date-1 and status = 0
Questions :
As date is already a sort key, is query #1 bound to be quick?
Do we need to create indexes on sort key fields ?
If answer to above question is yes: for query #2, do I need to create a GSI on date and status? with date as Partition key, and status as sort key?
Creating a GSI vs using filter expression on status for query #2. Which of the two is recommended?
Running analytical queries (such as count) is a wrong usage of a NoSQL database such as DynamoDB that is designed for scalable LOOKUP use cases.
Even if you get the SCAN to work with one design or another, it will be more expensive and slow than it should.
A better option is to export the table data from DynamoDB into S3, and then run an Athena query over that data. It will be much more flexible to run various analytical queries.
Easiest thing for you to do is a full table scan once per day filtering by yesterday's date, and as part of that keep your own client-side count on if the status was 0 or 1. The filter is not index optimized so it will be a true full table scan.
Why not an export to S3? Because you're really just doing one query. If you follow the export route you'll have to a new export every day to keep the data fresh and the cost of the export in dollar terms (plus complexity) is more than a single full scan. If you were going to do repeated queries against the data then the export makes more sense.
Why not use GSIs? They would make the table scan more efficient by minimizing what's scanned. However, there's a cost (plus complexity) in keeping them current.
Short answer: a once per day full table scan is both simple to implement and as fast as you want (parallel scan is an option), plus it's not really costly.
How much would it cost? Million rows, 100 bytes each, so that's a 100 MB table. That's 25,000 read units to fully scan, which is halved down to 12,500 with eventual consistency. On Demand pricing is $0.25 per million read units. 12,500 / 1,000,000 * $0.25 = $0.003. Less than a dime a month. It'd be cheaper still if you run provisioned.
Just do the scan. :)
Trying to run a initial bulk load into a Multi Set Table in a split of 10 Insert SQL's based on MOD on Identifier Column. The first and second insert is running but third is failing due to High CPU Skew.
The DBQLOGTBL shows the first SQL took about 10% CPU. The second took 30% and the third was taking 50% CPU hence failed.
The number of records being loaded in each is roughly same. The step which is failing as per the Explain plan is when TD does a MERGE to the main table using Spool.
What could be the solution to solve the problem?
Table is a MULTI SET with NUPI
Partition on a Date Column
Post Initial Load data volume will be 6 TB so roughly 600 GB is being inserted in 10 splits
I'm doing a project in asp.net core 2.1 (EF, MVC, SQL Server) and have a table called Orders, which in the end will be a grid (i.e. ledger) of trades and different calculations on those numbers (no paging...so could run hundreds or thousands of records long).
In that Orders table, is a property/column named Size. Size will basically be a lot value from 0.01 to maybe 10.0 in increments of 0.01..so 1000 different values to start and I'm guessing 95% of people will use values less than 5.0.
So originally, I thought i would use an OrderSize join table like so with a FK constraint to the Order table on Size (i.e. SizeId):
SizeId (Int) Value (decimal(9,2))
1 0.01
2 0.02
...etc, etc, etc...
1000 10.0
That OrderSize table will most likely never change (i.e. ~1000 decimal records) and the Size value in the Orders table could get quite repetitive if just dumping decimals in there, hence the reason for the join table.
However, the more I'm learning about SQL, the more I realize I have no clue what I'm doing and the bytes of space I'm saving might create a whole other performance robbing situation or who knows what.
I'm guessing the SizeId Int for the join uses 4 bytes? then another 5 bytes for the actual decimal Value? I'm not even sure I'm saving much space?
I realize both methods will probably work ok, especially on smaller queries? However, what is technically the correct way to do this? And are there any other gotchas or no-nos I should be considering when eventually calculating my grid values, like you would in an account ledger (i.e. assuming the join is the way to go)? Thank you!
It really depends what is your main objective behind using a lookup table. If its only around your concerns around storage space, then there are other ways you can design your database (using partitions and archiving bigger tables on cheaper storage).
That would be more scalable than using the lookup table approach (what happens if there are more than one decimal field in the Orders table - do you create a lookup table for each decimal field?).
You will also have to consider indexes on the Orders table while joining to the OrderSize table if you decide to go through that route. It can potentially lead to more frequent index scans if the join key is not part of the index on Orders table thereby causing slower query performance.
I would like to get a good understanding of what would be the price (in terms of $) of using DynamoDB Titan backend. For this, I need to be able to understand when DynamoDB Titan backend does reads and writes. Right now I am pretty clueless.
Ideally I would like to run a testcase which adds some vertices, edges and then does a rather simple traversal and then see how many reads and writes were done. Any ideas of how I can achieve this? Possibly through metrics?
If it turns out I can't extract this information myself, I would very much appreciate a first brief explanation about when DynamoDB Titan backend performs reads and writes.
For all Titan backends, to understand and estimate the number of writes, we rely on estimating the number of columns for a given KCVStore. You can also measure the number of columns that get written using metrics when using the DynamoDB Storage Backend for Titan.
To enable metrics, enable the configuration options listed here.
Specifically, enable lines 7-11.
Note the max-queue-length configuration property. If the executor-queue-size metric hits max-queue-length for a particular tx.commit() call, then you know that the queue / storage.buffer-size were not large enough. Once the executor-queue-size metric peaks without reaching max-queue-length, you know you have captured all the columns being written in a tx.commit() call, so that will give you the number of columns being changed in a tx.commit(). You can look at UpdateItem metrics for edgestore and graphindex to understand the spread of columns between the two tables.
All Titan storage backends implement KCVStore, and the keys and columns have different meanings depending on the kind of store. There are two stores that get the bulk of writes, assuming you have not turned on user-defined transaction logs. They are edgestore and graphindex.
The edgestore KCVStore is always written to, regardless of whether you configure composite indexes. Each edge and all of the edge properties of that edge are represented by two columns (unless you set the schema of that edge label to be unidirectional). The key of edge columns are the out-vertex of an edge in the direct column, and the in-vertex of an edge in the reverse. Again, the column of an edge is the in-vertex of an edge in the direct column, and the out-vertex of an edge in the reverse. Each vertex is represented by at least one column for the VertexExists hidden property, one column for a vertex label (optional) and one column for each vertex property. The key of vertices is the vertex id and the columns correspond to vertex properties, hidden vertex properties, and labels.
The graphindex KCVStore will only be written to if you configure composite indexes in the Titan management system. You can index vertex and edge properties. For each pair of indexed value and edge/vertex that has that indexed value, there will be one column in the graphindex KCVStore. The key will be a combination of the index id and value, and the column will be the vertex/edge id.
Now that you know how to count columns, you can use this knowledge to estimate the size and number of writes to edgestore and graphindex when using the DynamoDB Storage Backend for Titan. If you use the multiple-item data model for a KCVStore, you will get one item for each key-column pair. If you use the single-item data model for a KCVStore, you will get one item for all columns at a key (this is not necessarily true when graph partitioning is enabled but this is a detail I will not discuss now). As long as each vertex property is less than 1kb, and the sum of all edge properties for an edge are less than 1 kb, each column will cost 1 WCU to write when using multiple-item data model for edgestore. Again, each column in the graphindex will cost 1 WCU to write if you use the multiple-item data model.
Lets assume you did your estimation and you use multiple-item data model throughout. Lets assume you estimate that you will be writing 750 columns per second to edgestore and 750 columns per second to graphindex, and that you want to drive this load for a day. You can set the read capacity for both tables to 1, so you know each table will start off with one physical DynamoDB partition to begin with. In us-east-1, the cost for writes is $0.0065 per hour for every 10 units of write capacity, so 24 * 75 * $0.0065 is $11.70 per day for writes for each table. This means the write capacity would cost $23.40 per day for edgestore and graphindex together. The reads could be set to 1 read per second for each of the tables, making the read cost 2 * 24 * $0.0065 = $0.312 for both tables per day. If your AWS account is new, the reads would fall within the free tier, so effectively, you would only be paying for the writes.
Another aspect of DynamoDB pricing is storage. If you write 750 columns per second, that is 64.8 million items per day to one table, that means 1.9 billion (approximately 2 billion) items per month. The average number of items in the table in a month is then 1 billion. If each items averages out to 412 bytes, and there is 100 bytes of overhead, then that means 1 billion 512 byte items are stored for a month, approximately 477 GB in a month. 477 / 25 rounded up is 20, so storage for the first month at this load would cost 20 * $0.25 dollars a month. If you keep adding items at this rate without deleting them, the monthly storage cost will increase by approximately 5 dollars per month.
If you do not have super nodes in your graph, or vertices with a relatively large number of properties, then the writes to the edgestore will be distributed evenly throughout the partition key space. That means your table will split into 2 partitions when it hits 10GB, and then each of those partitions will split into a total of 4 partitions when they hit 10GB, and so on and so forth. the nearest power of 2 to 477 GB / (10 GB / partition) is 2^6=64, so that means your edgestore would split 6 times over the course of the first month. You would probably have around 64 partitions at the end of the first month. Eventually, your table will have so many partitions that each partition will have very few IOPS. This phenomenon is called IOPS starvation. You should have a strategy in place to address IOPS starvation. Two commonly used strategies are 1. batch cleanup/archival of old data and 2. rolling (time-series) graphs. In option 1, you spin up an EC2 instance to traverse the graph and write old data to a colder store (S3, Glacier etc) and delete it from DynamoDB. In option 2, you direct writes to graphs that correspond to a time period (weeks - 2015W1, months - 2015M1, etc). As time passes, you down provision the writes on the older tables, and when time comes to migrate them to colder storage, you read the entire graph for that time period and delete the corresponding DynamoDB tables. The advantage of this approach is that it allows you to manage your write provisioning cost with higher granularity, and it allows you to avoid the cost of deleting individual items (because you delete a table for free instead of incurring at least 1 WCU for every item you delete).