I have 2TB of cell phone records, about 33 billion readings of 1.8 million users.
I have created a partition on the user id.
Impala creates many sub-directories called userid=XXXXX.
This seems like over partitioning with 1.8 million sub-dirs. Is there a way to have to have partitions with a range or array of numbers?
Currently Impala does not have any sort of range partitioning. So You will need to partition on a different column in your table that would create fewer partitions. Or as a workaround you could add an additional column to your table which stores the specified range you want for each record and then partition on the "range" column. Example: You have a record with a field user_id=1234 which would then also have a field of your range user_range=0_100000, which you could use for partitioning.
Related
I have a collection which will store 8 million records monthly in cosmos collection which comes to 5GB of data monthly.
I want to allow a partition key datewise.
So the question is, should I keep the partition key as Year_Month or dividing it further to Year_Month_Day?
How many logical partitions are supported by cosmos db? is there any limit to it
There is no limit to the logical partition in Cosmos DB. It will keep on scaling and splitting those underlying physical partitions to support as many as you need.
The only limitation is that each logical partition can hold up to 10GB of data. Once that amount is reached you can not add more data in this logical partition and you have to migrate in a collection with a different key.
So with that in mind the decision should be like this.
Will you ever have 10GB worth of documents with the same Year_Month value? If not then that should be your partition key. If yes then you should widen the scope and add day in there. Again, will you ever have 10GB worth of documents with the same Year_Month_Day value? If yes then you need a different key definition.
I´m confused what to choose for PartitionKey and what effect it has. If I use Partitioned Collection then I must define a Partition Key that can be used by DocumentDB to distribute the data among multiple servers. But lets say that I choose a partitionKey that is always the same for all documents. Will I still be able to get up to 250k RU/s for a single Partitioned Collection?
In my case the main query is get all documents with paging but in a timeline (newest first)
SELECT TOP 10 c.id, c.someValue, u.id FROM c
JOIN u IN c.users ORDER BY c.createdDate DESC
A minified version of the document looks like this
{
id: "1",
someValue: "Foo"
createdDate: "2016-14-4-14:38:00.00"
//Max 100 users
users: [{id: "1", id: "2"}]
}
No, you need to have multiple distinct partition key values in order to achieve high throughput levels in DocumentDB.
A partition in DocumentDB supports up to 10,000 RU/s, so you need at least 25* distinct partition key values to reach 250 RU/s. DocumentDB divides the partition keys evenly across the available partitions, i.e. a partition might contain documents with multiple partition keys, but the data for a partition key is guaranteed to stay within a single partition. You must also structure your workload in a manner that distributes reads/writes across these partition keys.
*You may need a slightly higher number of partition keys than 25 (50-100) in practice since some of the partition keys might hash to the same partition
So, we have a partitioned (10 partitions) collection with a throughput of 10000 RU/s. Partition Key is CountryCode and we only have data for 5 countries. Data for two countries were hashed into the same physical partition. As per documentation found in the following link, we were expecting data to be reorganized to the empty partitions once the 10GB limit was hit for the said partition. That didn't happen and we could no longer add data for those two countries.
Obviously, the right thing to do would be to choose a partition key that ensures low cardinality, but the documentation is misleading.
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
When a physical partition p reaches its storage limit, Cosmos DB seamlessly splits p into two new partitions p1 and p2 and distributes values corresponding to roughly half the keys to each of the partitions. This split operation is invisible to your application.
I would like to get a good understanding of what would be the price (in terms of $) of using DynamoDB Titan backend. For this, I need to be able to understand when DynamoDB Titan backend does reads and writes. Right now I am pretty clueless.
Ideally I would like to run a testcase which adds some vertices, edges and then does a rather simple traversal and then see how many reads and writes were done. Any ideas of how I can achieve this? Possibly through metrics?
If it turns out I can't extract this information myself, I would very much appreciate a first brief explanation about when DynamoDB Titan backend performs reads and writes.
For all Titan backends, to understand and estimate the number of writes, we rely on estimating the number of columns for a given KCVStore. You can also measure the number of columns that get written using metrics when using the DynamoDB Storage Backend for Titan.
To enable metrics, enable the configuration options listed here.
Specifically, enable lines 7-11.
Note the max-queue-length configuration property. If the executor-queue-size metric hits max-queue-length for a particular tx.commit() call, then you know that the queue / storage.buffer-size were not large enough. Once the executor-queue-size metric peaks without reaching max-queue-length, you know you have captured all the columns being written in a tx.commit() call, so that will give you the number of columns being changed in a tx.commit(). You can look at UpdateItem metrics for edgestore and graphindex to understand the spread of columns between the two tables.
All Titan storage backends implement KCVStore, and the keys and columns have different meanings depending on the kind of store. There are two stores that get the bulk of writes, assuming you have not turned on user-defined transaction logs. They are edgestore and graphindex.
The edgestore KCVStore is always written to, regardless of whether you configure composite indexes. Each edge and all of the edge properties of that edge are represented by two columns (unless you set the schema of that edge label to be unidirectional). The key of edge columns are the out-vertex of an edge in the direct column, and the in-vertex of an edge in the reverse. Again, the column of an edge is the in-vertex of an edge in the direct column, and the out-vertex of an edge in the reverse. Each vertex is represented by at least one column for the VertexExists hidden property, one column for a vertex label (optional) and one column for each vertex property. The key of vertices is the vertex id and the columns correspond to vertex properties, hidden vertex properties, and labels.
The graphindex KCVStore will only be written to if you configure composite indexes in the Titan management system. You can index vertex and edge properties. For each pair of indexed value and edge/vertex that has that indexed value, there will be one column in the graphindex KCVStore. The key will be a combination of the index id and value, and the column will be the vertex/edge id.
Now that you know how to count columns, you can use this knowledge to estimate the size and number of writes to edgestore and graphindex when using the DynamoDB Storage Backend for Titan. If you use the multiple-item data model for a KCVStore, you will get one item for each key-column pair. If you use the single-item data model for a KCVStore, you will get one item for all columns at a key (this is not necessarily true when graph partitioning is enabled but this is a detail I will not discuss now). As long as each vertex property is less than 1kb, and the sum of all edge properties for an edge are less than 1 kb, each column will cost 1 WCU to write when using multiple-item data model for edgestore. Again, each column in the graphindex will cost 1 WCU to write if you use the multiple-item data model.
Lets assume you did your estimation and you use multiple-item data model throughout. Lets assume you estimate that you will be writing 750 columns per second to edgestore and 750 columns per second to graphindex, and that you want to drive this load for a day. You can set the read capacity for both tables to 1, so you know each table will start off with one physical DynamoDB partition to begin with. In us-east-1, the cost for writes is $0.0065 per hour for every 10 units of write capacity, so 24 * 75 * $0.0065 is $11.70 per day for writes for each table. This means the write capacity would cost $23.40 per day for edgestore and graphindex together. The reads could be set to 1 read per second for each of the tables, making the read cost 2 * 24 * $0.0065 = $0.312 for both tables per day. If your AWS account is new, the reads would fall within the free tier, so effectively, you would only be paying for the writes.
Another aspect of DynamoDB pricing is storage. If you write 750 columns per second, that is 64.8 million items per day to one table, that means 1.9 billion (approximately 2 billion) items per month. The average number of items in the table in a month is then 1 billion. If each items averages out to 412 bytes, and there is 100 bytes of overhead, then that means 1 billion 512 byte items are stored for a month, approximately 477 GB in a month. 477 / 25 rounded up is 20, so storage for the first month at this load would cost 20 * $0.25 dollars a month. If you keep adding items at this rate without deleting them, the monthly storage cost will increase by approximately 5 dollars per month.
If you do not have super nodes in your graph, or vertices with a relatively large number of properties, then the writes to the edgestore will be distributed evenly throughout the partition key space. That means your table will split into 2 partitions when it hits 10GB, and then each of those partitions will split into a total of 4 partitions when they hit 10GB, and so on and so forth. the nearest power of 2 to 477 GB / (10 GB / partition) is 2^6=64, so that means your edgestore would split 6 times over the course of the first month. You would probably have around 64 partitions at the end of the first month. Eventually, your table will have so many partitions that each partition will have very few IOPS. This phenomenon is called IOPS starvation. You should have a strategy in place to address IOPS starvation. Two commonly used strategies are 1. batch cleanup/archival of old data and 2. rolling (time-series) graphs. In option 1, you spin up an EC2 instance to traverse the graph and write old data to a colder store (S3, Glacier etc) and delete it from DynamoDB. In option 2, you direct writes to graphs that correspond to a time period (weeks - 2015W1, months - 2015M1, etc). As time passes, you down provision the writes on the older tables, and when time comes to migrate them to colder storage, you read the entire graph for that time period and delete the corresponding DynamoDB tables. The advantage of this approach is that it allows you to manage your write provisioning cost with higher granularity, and it allows you to avoid the cost of deleting individual items (because you delete a table for free instead of incurring at least 1 WCU for every item you delete).
I output the query plan on SQLite, and it shows
0|0|0|SCAN TABLE t (~500000 rows)
I wonder what is the meaning of the number (500000)? I guess it is the table length, but I executed the query on a small table which does not have so many rows.
Is there any official document about the meaning of the number? thanks.
As the official documentation says, this is the number of rows that the database estimates will be returned.
If there is an index on a seached column, and if you have run ANALYZE, then SQLite can make an estimate based on the actual data. Otherwise, it assumes that tables contain one million rows, and that a search like column > x filters out half the rows.
I have a table which has the primary key on one column and is partitioned by a date column. This is sample format of the DDL:
CREATE MULTISET TABLE DB.TABLE_NAME,
NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
( FIRST_KEY DECIMAL(20,0) NOT NULL,
SECOND_KEY DECIMAL(20,0) ,
THIRD_COLUMN VARCHAR(5),
DAY_DT DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX TABLE_NAME_IDX_PR (FIRST_KEY)
PARTITION BY RANGE_N(DAY_DT BETWEEN DATE '2007-01-06'
AND DATE '2016-01-02' EACH INTERVAL '1' DAY );
COLLECT STATS ON DB.TABLE_NAME COLUMN(FIRST_KEY);
The incoming data can be of size 30 million each day and I have loaded the data for 2012-04-11. Now i have to collect stats for only '2012-04-11' partition instead of whole table.
Is there any way to collect partition for a particular day?
You can simply collect stats on the system column PARTITION and it should update the histograms relating to the partitioned column.
COLLECT STATS ON {databasename}.{tablename} COLUMN (PARTITION);
This can be collected on both partitioned and non-partitioned tables. It helps provided the optimizer cardinality of the table and partitions (if they exist). It will update the statistics for all the partitions on the table. Collecting stats on the PARTITION column is a low CPU cost, short wall clock process. It is significantly less expensive than collecting stats on a physical column or the entire table. (Even for tables with millions, tens of millions or more records.)
If you want to determine whether the optimizer recognizes the refreshed statistics there is no direct way as of TD 13.10 (not sure about TD 14.x). However, if you run an EXPLAIN on your query you can tell if the optimizer has high confidence on the step which the criteria against the partitioned column is included. If you specify a single date, such as DATE '2012-04-11' you should see in the EXPLAIN that partition elimination has taken place on a single partition.
If you need help with digesting the EXPLAIN, edit your original question with the EXPLAIN plan for the query and I will help you digest it.