Partition By & Clustered & Distributed By in USql - Need to know their meaning and when to use them - u-sql

I can see that while creating table in USQL we can use Partition By & Clustered & Distributed By clauses.
As per my understanding partition will store data of same key (on which we have partition) together or closer (may be in same structured stream at background), so that our query will be more faster when we use that key in joins, filter.
Clustering is - I guess it stores data of those columns together or closer inside each partition.
And Distribution is some method like Hash or Round Robin - the way of storing data inside each partition. If you have integer column and you frequently query within some range , use range else use hash. If your data is not distributed equally then you may face data skew issue, so in that case use round robin.
Question 2: Please let me know whether my understanding is correct or not?
Question 1: There is INTO clause - I want to know how we should identify value for this INTO clause for DISTRIBUTION?
Question 3: Also want to know that which one is vertical partitioning and which one is horizontal?
Question 4: I don't see any good online document to learn these concepts with examples. If you know please send me links.

Peter and Bob have given you links to documentation.
To quickly answer your questions here:
Partitions and distributions both partition the data based on the partitioning scheme and both provide data scale out and partition elimination.
Partitions are optional and individually manageable for data life cycle management (besides giving you the ability to get partition elimination) and currently only support a value-based partition based on the same column values.
Each Partition then gets further partitioned based on the distribution scheme. Here you have different schemes (HASH, RANGE etc). The system decides on the number of distribution buckets based on some heuristic. In the case of HASH partitions, you can also specify the number of buckets with the INTO clause.
The clustering will then specify the order of the rows within a distribution bucket and allows you to further improve query performance (you can to a range scan instead of a full scan for example).
Vertical and horizontal partitioning are terms sometimes used to separate these two levels of partitioning. I try to avoid it, since it can be confusing to remember which one is which.

Related

DynamoDB: Querying all similar items of a certain type

Keeping in mind the best practices of having a single table and to evenly distribute items across partitions using as unique partition keys as possible in DynamoDB, I am stuck at one problem.
Say my table stores items such as users, items and devices. I am storing the id for each of these items as the partition key. Each id is prefixed with its type such as user-XXXX, item-XXXX & device-XXXX.
Now the problem is how can I query only a certain type of object? For example I want to retrieve all users, how do I do that? It would have been possible if the begin_with operator was allowed for partition keys so I could search for the prefix but the partition keys only allow the equality operator.
If now I use my types as partition keys, for example, user as partition key and then the user-id as the sort key, it would work but it would result in only a few partition keys and thus resulting in the hot keys issue. And creating multiple tables is a bad practice.
Any suggestions are welcome.
This is a great question. I'm also interested to hear what others are doing to solve this problem.
If you're storing your data with a Partition Key of <type>-<id>, you're supporting the access pattern "retrieve an item by ID". You've correctly noted that you cannot use begins_with on a Partition Key, leaving you without a clear cut way to get a collection of items of that type.
I think you're on the right track with creating a Partition Key of <type> (e.g. Users, Devices, etc) with a meaningful Sort Key. However, since your items aren't evenly distributed across the table, you're faced with the possibility of a hot partition.
One way to solve the problem of a hot partition is to use an external cache, which would prevent your DB from being hit every time. This comes with added complexity that you may not want to introduce to your application, but it's an option.
You also have the option of distributing the data across partitions in DynamoDB, effectively implementing your own cache. For example, lets say you have a web application that has a list of "top 10 devices" directly on the homepage. You could create partitions DEVICES#1,DEVICES#2,DEVICES#3,...,DEVICES#N that each stores the top 10 devices. When your application needs to fetch the top 10 devices, it could randomly select one of these partitions to get the data. This may not work for a partition as large as Users, but is a pretty neat pattern to consider.
Extending this idea further, you could partition Devices by some other meaningful metric (e.g. <manufactured_date> or <created_at>). This would more uniformly distribution your Device items throughout the database. Your application would be responsible for querying all the partitions and merging the results, but you'd reduce/eliminate the hot partition problem. The AWS DynamoDB docs discuss this pattern in greater depth.
There's hardly a one size fits all approach to DynamoDB data modeling, which can make the data modeling super tricky! Your specific access patterns will dictate which solution fits best for your scenario.
Keeping in mind the best practices of having a single table and to evenly distribute items across partitions
Quickly highlighting the two things mentioned here.
Definitely even distribution of partitions keys is a best practice.
Having the records in a single table, in a generic sense is to avoid having to Normalize like in a relational database. In other words its fine to build with duplicate/redundant information. So its not necessarily a notion to club all possible data into a single table.
Now the problem is how can I query only a certain type of object? For
example I want to retrieve all users, how do I do that?
Let's imagine that you had this table with only "user" data in it. Would this allow to retrieve all users? Ofcourse not, unless there is a single partition with type called user and rest of it say behind a sort key of userid.
And creating multiple tables is a bad practice
I don't think so its considered bad to have more than one table. Its bad if we store just like normalized tables and having to use JOIN to get the data together.
Having said that, what would be a better approach to follow.
The fundamental difference is to think about the queries first to derive at the table design. That will even suggest if DynamoDB is the right choice. For example, the requirement to select every user might be a bad use case altogether for DynamoDB to solve.
The query patterns will further suggest, what is the best partition key in hand. The choice of DynamoDB here is it because of high ingest and mostly immutable writes?
Do I always have the partition key in hand to perform the select that I need to perform?
What would the update statements look like, will it have again the partition key to perform updates?
Do I need to further filter by additional columns and can that be the default sort order?
As you start answering some of these questions, a better model might appear altogether.

custom partition in clickhouse

I have several questions about custom partitioning in clickhouse. Background: i am trying to build a TSDB on top of clickhouse. We need to support very large batch write and complicated OLAP read.
Let's assume we use the standard partition by month , and we have 20 nodes in our clickhouse cluster. I am wondering will the data from same month all flow to the same node or will clickhouse do some internal balance and put the data from same month to several nodes?
If all the data from same month write to the same node, then it will be very bad for our scenario. I will probably consider patition by (timestamp, tags)where tags are the different tags that define the data source. Our monitoring system will write data to TSDB every 30 seconds. Our read pattern is usually single table range scan or several tables join on a column. Any advice on how should i customize my partition strategy?
Since clickhouse does not support secondary index, and we will run selection query on columns, i think i should put those important columns into the primary key, so my primary key will probably be like (timestamp, ip, port...), any advice on this design or make give a good reason why clickhouse does not support secondary index like bitmap index on other non-primary column?
In ClickHouse, partitioning and sharding are two independent mechanisms. Partitioning by month means that data from different months will never be merged to be stored in same file on a filesystem and has nothing to do with data placement between nodes (which is controlled by choosing how exactly do you setup tables and run your INSERT INTO queries).
Partitioning by months or weeks is usually doing fine, for choosing primary key see official documentation: https://clickhouse.yandex/docs/en/operations/table_engines/mergetree/#selecting-the-primary-key
There are no fundamental issues with adding those, for example bloom filter index development is in progress: https://github.com/yandex/ClickHouse/pull/4499

Best way to generate identifiers?

I will incrementally insert rows to a table. This table stores sales facts, and has some columns that will be used to define a identifier: business id (int), product name (string), product price (float). E.g. <1, heineken, 1.0>, <1, heineken, 22.99>.
Certainly, these values will be used in joins. When thinking the SQL way, I would create a hashed column using those columns. This way, I would be able to optimize some queries.
How about data lake and u-sql? Should I calculate the hash on insert? Should I leave it as is? Should I simply concatenate the values and create a big string?
Thanks in advance.
While U-SQL supports clustering and distribution schemes on multiple columns, you probably could gain some additional performance in your joins if you find an efficient value to do your equi-join comparison. So you could calculate a hash or concatenate.
However, I think finding the right distribution scheme and clustering is the better "bang for your buck".
And, more importantly, please do not incrementally insert small number of rows, but use bulk insertion of many rows at the same time (e.g., daily or weekly). And regularly rebuild the table or table partition to avoid table fragmentation that will have a much bigger impact on your query performance.

MariaDB partitionning last 3 month

First, I explain my problem:
This is a table that will contain approximately 5,000,000 record per year, these records will be kept at least 10 years (it is not yet defined). We talk about events of production machine. I generate a report + a dashbord for displaying various information relatively complex (average number of events per 10 minutes a month, graphics, ...) and also wants to see the records themselves. The data displayed will be in large majority of the last 2 months, viewing the rest of the data must always be possible but at a lower speed of access.
I work on MariaDB v10.1.12.
The idea was to make a partition on the last 3 months. I realize now that this is not so easy. I have not found any solution to this partition, in fact, it is impossible to make a partition based on a now() or other current_date() etc. directly or indirectly via another calculated column.
Do you have any ideas for me? Perhaps another solution than a partition.
Thank you in advance.
I recommend PARTITION BY RANGE(TO_DAYS(...)) If you are only now breaking the table into partitions, I would recommend annual partitions for data before this year, then quarterly or monthly partitions henceforth. Yes, that, in theory, leads to an infinite number of partitions, but I predict that you will revamp the data structure within a few years.
20-50 partitions is a good number. More than that leads to inefficiencies due to the multitude of partitions; less than that leads to asking "why bother".
Use InnoDB. Design the PRIMARY KEY carefully, since it may be useful as the primary index into the data.
Usually it is best to put the date/timestamp column last in any indexes. Putting it first would be redundant since partition pruning comes first.
More on partitioning.
It sounds like a main purpose for the table is to summarize the data for graphing, etc. In that case, it may be very beneficial to build and maintain "Summary table(s)" of counts and subtotals over selected time intervals. 100 rows get added up for a 10-minute interval? If so, then the summary table based on 10-minute intervals would have 1/100th as many rows, and the queries would be much faster. Plus, you could 'denormalize' the summary tables to make them even simpler.
More on Summary tables.
It might be worth it to gather data for 10 minutes into a staging table, then summarize it into the summary table. And also throw the raw data into the big table.
Or, if the summary tables have everything you need, you could abandon the big table. Or, as a compromise, keep 12 month's worth of data (partitioned by month), and DROP PARTITION for older data. Meanwhile, the summary tables can continue to grow (although they will be much smaller).
Table partitioning is an advance features, it is not indexing, but rearrangement of tables data. So it is not "duplicate", indeed new data will stored according to the predefined partitioning range.
You must also specify month range criteria as usual. you MUST create index if those column are not used as partition range. When you make a select, algorithm that associate with partition table will handle those merging(if required) in background. So you just treat partition exactly like your typical table.
For more details, please check Mariadb paritioning overview

(CHORD) Peer-2-Peer How does it work/What does it do?

https://en.wikipedia.org/wiki/Chord_(peer-to-peer)
I've looked into Chord and i'm having trouble understanding exactly what it does.
It's a protocol for a distributed hash table which stores various keys/values for later usage? Is it just an efficient way to look up in the hash table what value for a given key?
Any help such as a basic example would be much appreciated
An example question is say if I hashed inserting string "Hi" to 3 and there were no peers at 3 it would go to the next available peer and store it there right? Or where does it store it's values to?
I already answered a similar question for bittorrent/kademlia, so just to summarize in a more general sense:
DHTs store the values with some redundancy on N nodes whose ID is closest to the target hash.
Considering the vastness of >= 128bit keyspaces it would be extremely unlikely for nodes to exactly match the key. At least in routing schemes where nodes don't adjust their IDs based on content, and chord is one of those.
It's pretty much the same as regular hash tables, hence distributed hash table. You have a limited set of buckets into which the entries are hashed, where the bucket space is much smaller than the potential input keyspace and thus does not precisely match the keys either.

Resources