How to read the specified partition of a COMPO partition table? - database-partitioning

I create a partitioned database with COMPO domain, which schema is shown as follows:
partitionSchema->([2021.01.31,2021.01.30,2021.01.29,2021.01.28,2021.01.27,2021.01.26,2021.01.25,2021.01.24,2021.01.23,2021.01.22,...],10)
databaseDir->dfs://StockTick
engineType->OLAP
partitionSites->
partitionTypeName->[VALUE,HASH]
partitionType->[1,5]
The first level of partitions is VALUE domain, and the second level of partitions is HASH domain.
How to read the data in each partition?

Suppose you want to read the data of partition /level2/20210813/Key0. You may use this piece of script:
select * from loadTable("dfs://StockTick","bond_tick") where trade_date=2021.08.13,partition(secu_code, 0)

Related

Can I create with DynamoDB multiple tables with secondary index concurrencly?

I am confused by the API documentation of CreateTable from DynamoDB. I need to create multiple tables with a secondary index. From the API: https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/dynamodb/DynamoDbClient.html#createTable-software.amazon.awssdk.services.dynamodb.model.CreateTableRequest-
If you want to create multiple tables with secondary indexes on them, you must create the tables sequentially. Only one table with secondary indexes can be in the CREATING state at any given time.
and
Up to 500 simultaneous table operations are allowed per account. These operations include CreateTable, UpdateTable, DeleteTable, UpdateTimeToLive, RestoreTableFromBackup, and RestoreTableToPointInTime.
The only exception is when you are creating a table with one or more secondary indexes. You can have up to 250 such requests running at a time;
Can I create now only one table with a secondary index or 250 at the same time?
If I create multiple tables sequential without waiting on active state is this already concurrency creation?
Must I wait on the active state for every table if I create multiple tables with secondary indexes?
An individual account can only be running one "Create Index" action at a time, no matter how many tables you have.
To understand this it may help to understand what an Index is. An Index is a complete copy of the table, but with a different partition and sort key. So if your original table has a PK of of userId and a sk of sort_key you could now create an index where the partition key is set to sort_key and the sort_key is now set to userId creating an inverted index (a common practice in Dynamo - remember Queries in Dynamo must know what the PK is, so if you have UserID you could access all data of a given User, or if you wanted all Users who have a particular tag, you may have an SK item on users that is something like TAG#ThisTag and then you wanted all users with ThisTag you could do a query against the inverted index with a pk = TAG#ThisTag and get back a list of UserIds.)
While the CreateIndex is being run on a given table, no other actions can be run on it - it wont accept changes to the data/configuration that would cause a fault/mismatch in the copying process. This is one of the reasons a given account is limited to only one create index operation at a time.
As a slight aside if I may - if you have a single account with multiple Dynamos all for the same product, you may want to rethink your database strategy. A single Dynamo Table can be used for many different storages if you set up your PK-SK as generic fields (ie: pk and sk as the attribute names) - No document inside your dynamo has to have the same attributes as any other. And when accessing data, each partition key is exactly as its named - a Partition of data that is all that is accessed when a query is made against that PK. (so if you have 100 items with PK of USER#1 and 100 items with a PK of USER#2 and you query against USER#1 you only access that 100 items - the rest are ignored by the Query and never ever touched - allowing you to in effect have multiple "tables" in a single DynamoDB Table by giving them different Partition Key prefixes.)

Are Azure CosmosDB indexes split by partition

I am sending some IoT events into Azure Cosmos DB. I am partitioning by device id and I am always querying by device id. I want to know if the automatically created indexes are separated by partition key. Specifically if I do query like
SELECT TOP 5 ... FROM events WHERE deviceId = X ORDER BY timeStamp DESC
Will it use the automatically created index on timeStamp and if so is it effective. Basically what I am asking is if there are separate indexes on timeStamp for each partition key (deviceId in my case) because otherwise the index will be relatively useless because the range will contain a lot of irrelevant data from other devices. If this was SQL Server I would create an index on deviceId followed by timeStamp but I am not sure how Cosmos DB works by default.
Indexes sit within the partition so yes.
For this query you have you should also create a composite index with DESC sort order for the best performance.

Choosing Primary key for DynamoDB

A bit of context: I am trying to build an inventory to list my AWS resources in various accounts and I am planning to use DynamoDB to store the data. These will be the columns for my table: ResourceARN, ResourceName, ResourceType, StandardTag, IsDeleted, LastUpdateTime and ResourceCreationDate ( this field is available only for a few resource types like Ec2).
Question: I want to query my DDB table using account ID, resource type and tag name. I am stumped on choosing the primary key for the table. Since primary key should be unique and has to have 1:many relationship. Hence, I cannot use a combination of resourceType and account Id. Nor can I use resourceArn as my primary key since it is 1:1 relationship. Also, using the resourceARN as the sort key does not make sense to me. I understand that I can use a simple scan operation, but that is very costly and will take time if I add more data in my DDB.
I would appreciate any suggestions or guidance over the same.
Short answer
Partition key: Account ID
Sort key: <resource type>/<resource ID>
Rationale
It's a common pattern for a sort key to be a string concatenating multiple attributes. Since sort keys can be queried by prefix, you can leverage this in your queries:
Get all account resources: query all sort keys on the Account ID partition key
Get all EC2 instances of an account: query with partition key = <your account ID> and sort key begins_with('ec2-instance').
You may notice that ARNs follow such a hierarchy as well (what's probably not a coincidence). This would be effectively using a subset of the ARN as the sort key.
Some notes:
DynamoDB is about attributes as much as about columns. You don't need to include ResourceCreationDate in the records which don't have it, and doing so will save you space (see next point).
Attribute names count as storage for every record, which impacts cost and also throughput. It's common to use shorthand for names for this reason (rct instead of ResourceCreationTime for example).
You can use LSIs (Local Secondary Indexes) to order by creation and update times if you need this.

AWS DynamoDB Query based on non-primary keys

I'm new to AWS DynamoDB and wanted to clarify something. Is it possible to query a table and filter base on a non-primary key attribute. My table looks like the following
Store
Id: PrimaryKey
Name: simple string
Location: simple string
Now I want to query on the Name, but I think I have to give the key as well from what I know? Apart from that I can use the scan but then I will be loading all the data.
From the docs:
The Query operation finds items based on primary key values. You can query any table or secondary index that has a composite primary key (a partition key and a sort key).
DynamoDB requires queries to always use the partition key.
In your case your options are:
create a Global Secondary Index that uses Name as a primary key
use a Scan + Filter if the table is relatively small, or if you expect the result set will include the majority of the records in the table
There are few designs principals that you can follow while you are using DynamoDB. If you are coming from a relational background, you have already witnessed the query limitations from primary key attributes.
Design your tables, for querying and separating hot and cold data.
Create Indexes for Querying from Non Key attributes (You have two options, Global Secondary Index which you can define at any time and Local Secondary Index which you need to specify at table creation time).
With the Global Secondary Index you can promote any NonKey attribute as the Partition Key for the Index and select another attribute for Sort Key for querying. For Local Secondary Index, you can promote any Non Key attribute as the Sort Key keeping the same Partition Key.
Using Indexes for query is important also to improve the efficiency in using provisioned throughput.
Although having indexes consumes the read throughput from the table, it also saves read through put from in a way that, if you project the right amount of attributes to read, it can give a huge benefit in reading. Check the following example.
Lets say you have a DynamoDB table that has items of 40KB. If you read directly from the table to list 10 items, it consumes 100 Read Throughput Units (For one item 10 Units since one unit can read 4KB and multiply it by 10). If you have an index defined just to project the attributes needed to list which will be having 4KB per item, then it will be consuming only 10 Read Throughput Units(One Unit per item) which makes a huge difference in terms of cost.
With DynamoDB its really important how you define Indexes to optimize for Querying not only from Query capability but also in terms of throughput.
You can not query based non-primary key attribute in Dynamo Db.
If you wanted to still do that you can do it using scan query,but scan is costly operation in DyanmoDB and if table is large, then it will affect performance and not recommended because it will scan each item in table and AWS cost you for all item it scan for that query.
There are two ways to achieve it
Keep Store Id as your PrimaryKey/ Partaion key of Dyanmo DB table and add Name/Location as sort Key (only one as Dyanmo DB accept only one Attribute as sort key by design.
Create Global Secondary Indexes for Querying from Non Key attributes which you are more frequenly required.
There are 3 ways to created GSI in Dyanamo DB, In your case select GSI with option INCLUDE and add Name , Location and store ID in Idex.
KEYS_ONLY – Each item in the index consists only of the table partition key and sort key values, plus the index key values. The KEYS_ONLY option results in the smallest possible secondary index.
INCLUDE – In addition to the attributes described in KEYS_ONLY, the secondary index will include other non-key attributes that you specify.
ALL – The secondary index includes all of the attributes from the source table. Because all of the table data is duplicated in the index, an ALL projection results in the largest possible secondary index.

datafile or Tablespace use information

Without creating a trigger, are there any V$ views that show when either a Tablespace or datafile was last accessed or used?
Give you an idea of why... I'm looking to do some reorg and would be nice to know if I can take that particular object or tbs offline.
DBA_HIST_SEG_STAT records the number of reads per tablespace per snapshot. The DBA_HIST_ tables are only periodically refreshed, normally once per hour. To retrieve the latest data, a very similar query using V$SEGMENT_STATISTICS would need to be UNIONed to the query below.
Finding the information per data file is trickier. That information is in DBA_HIST_ACTIVE_SESS_HISTORY, usually in the column P1 when P1TEXT = 'file#'. But that information is only a sample, it's very possible a single read to a data file would not be captured.
Note that using the DBA_HIST_ tables requires the Configuration Pack license.
select name, begin_interval_time, end_interval_time, sum(logical_reads_delta)
from dba_hist_seg_stat
join dba_hist_snapshot using (snap_id, dbid, instance_number)
join v$tablespace using (ts#)
group by v$tablespace.name, begin_interval_time, end_interval_time
having sum(logical_reads_delta) > 0
order by v$tablespace.name, begin_interval_time desc

Resources