Teradata Partition By to SnowFlake Cluster By - teradata

In Teradata we specify partition by on huge tables to make data retrieval faster and effective.
In Snowflake though it does Micro Partitioning itsely, we still can specify CLUSTER BY right?
Does Cluster By support Multi Level Partitioning?
How will below look in Snowflake?
PARTITION BY ( RANGE_N(trans_dt BETWEEN DATE '2012-12-01' AND DATE '2025-12-31' EACH INTERVAL '1' MONTH ), CASE_N(rec_ind='Y',
rec_ind='N',
rec_ind not in('Y','n')));

Related

Creating partitions locks the table?

I have a couple of big tables (from 60M rows to 2Bi rows) to create some partitions on it, as they are used in the core of our platform we are trying to find out if when we start to create the partitions in the production environment the database will generate any kind of lock on those tables?
We are on MariaDB 10.0.24 and 10.1.34.
ALTER TABLE data_history PARTITION BY RANGE ( period ) (
PARTITION past VALUES LESS THAN (201801),
PARTITION p2018 VALUES LESS THAN (201901),
PARTITION p2019 VALUES LESS THAN (202001),
PARTITION p2020 VALUES LESS THAN (202101),
PARTITION future VALUES LESS THAN MAXVALUE
);
period column is an integer field that goes with year+month (YYYYMM) format.
When adding PARTITIONing to a table, the entire table is copied over. Hence the lock.
When adding a PARTITION to an already-partitioned table, it depends on the actual command issued. All (?) cases will LOCK, but the length of the lock varies. Please provide the actual command.
Why partition? Partitioning does not intrinsically lead to any performance improvements. See this for discussion of the few cases where partitioning is beneficial: http://mysql.rjweb.org/doc.php/partitionmaint

custom partition in clickhouse

I have several questions about custom partitioning in clickhouse. Background: i am trying to build a TSDB on top of clickhouse. We need to support very large batch write and complicated OLAP read.
Let's assume we use the standard partition by month , and we have 20 nodes in our clickhouse cluster. I am wondering will the data from same month all flow to the same node or will clickhouse do some internal balance and put the data from same month to several nodes?
If all the data from same month write to the same node, then it will be very bad for our scenario. I will probably consider patition by (timestamp, tags)where tags are the different tags that define the data source. Our monitoring system will write data to TSDB every 30 seconds. Our read pattern is usually single table range scan or several tables join on a column. Any advice on how should i customize my partition strategy?
Since clickhouse does not support secondary index, and we will run selection query on columns, i think i should put those important columns into the primary key, so my primary key will probably be like (timestamp, ip, port...), any advice on this design or make give a good reason why clickhouse does not support secondary index like bitmap index on other non-primary column?
In ClickHouse, partitioning and sharding are two independent mechanisms. Partitioning by month means that data from different months will never be merged to be stored in same file on a filesystem and has nothing to do with data placement between nodes (which is controlled by choosing how exactly do you setup tables and run your INSERT INTO queries).
Partitioning by months or weeks is usually doing fine, for choosing primary key see official documentation: https://clickhouse.yandex/docs/en/operations/table_engines/mergetree/#selecting-the-primary-key
There are no fundamental issues with adding those, for example bloom filter index development is in progress: https://github.com/yandex/ClickHouse/pull/4499

How to dump a Cassandra DB into a triplestore?

I have a Cassandra DB with a table containing values loaded from sensors, composed of 5 columns(user , sensor, observed param, value, timestamp).
I would like to represent them in a triplestore in order to perform SPARQL query to get knowledge.
Initially i thought to Titan because it's based on Cassandra, but i don't find a way to automatically dump the Cassandra's table in a Titan graph.
Is there a way to do it with Titan or others triplestore?

How can check if there's a data in Oracle partition for a given day?

How can I "ping" multi-day Oracle partition and check if it contains data for a given day?
Ideally something that works fast.

Collect statistics for a single partition in Teradata

I have a table which has the primary key on one column and is partitioned by a date column. This is sample format of the DDL:
CREATE MULTISET TABLE DB.TABLE_NAME,
NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
( FIRST_KEY DECIMAL(20,0) NOT NULL,
SECOND_KEY DECIMAL(20,0) ,
THIRD_COLUMN VARCHAR(5),
DAY_DT DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX TABLE_NAME_IDX_PR (FIRST_KEY)
PARTITION BY RANGE_N(DAY_DT BETWEEN DATE '2007-01-06'
AND DATE '2016-01-02' EACH INTERVAL '1' DAY );
COLLECT STATS ON DB.TABLE_NAME COLUMN(FIRST_KEY);
The incoming data can be of size 30 million each day and I have loaded the data for 2012-04-11. Now i have to collect stats for only '2012-04-11' partition instead of whole table.
Is there any way to collect partition for a particular day?
You can simply collect stats on the system column PARTITION and it should update the histograms relating to the partitioned column.
COLLECT STATS ON {databasename}.{tablename} COLUMN (PARTITION);
This can be collected on both partitioned and non-partitioned tables. It helps provided the optimizer cardinality of the table and partitions (if they exist). It will update the statistics for all the partitions on the table. Collecting stats on the PARTITION column is a low CPU cost, short wall clock process. It is significantly less expensive than collecting stats on a physical column or the entire table. (Even for tables with millions, tens of millions or more records.)
If you want to determine whether the optimizer recognizes the refreshed statistics there is no direct way as of TD 13.10 (not sure about TD 14.x). However, if you run an EXPLAIN on your query you can tell if the optimizer has high confidence on the step which the criteria against the partitioned column is included. If you specify a single date, such as DATE '2012-04-11' you should see in the EXPLAIN that partition elimination has taken place on a single partition.
If you need help with digesting the EXPLAIN, edit your original question with the EXPLAIN plan for the query and I will help you digest it.

Resources