Using varchar column as partition in Teradata to speedup truncate - teradata

I have a Teradata table and two columns company_name varchar(500) and case_name varchar(500).
The value of the two partitions are limited in hundreds level. But it is not controlled by me. And I can't predefine a set of their value.
It is a daily operation to truncate all data of specified company_name and case_name. So I want to use these two columns as partitions.
Is it supported to do so? And will it helps if truncate data by partition in TD? If this is not supported. Is there a best practice to truncate data by two varchar columns?

When the access to those columns is (mainly) based on where company_name = 'foo' and case_name = 'bar' you can apply a calculation like this
PRIMARY INDEX ( PIcol)
PARTITION BY
Range_N(HashBucket(HashRow(company_name,case_name)) MOD 65533 BETWEEN 0 AND 65532 EACH 1)
A delete from where company_name = 'foo' and case_name = 'bar' will access a single partition, but it's not a FastPath delete, it will be transient journaled.

Related

Query composite index query in Dynamo

I have a DynamoDB table with following keys:
id: partition key
created_at: sort key
brand#category#size#color: partition key for global index 'byAttributes'
The global index partition key is a composite of 4 table attributes using '#' as a delimiter.
Is there a way in DynamoDB that I can query the table using only a subset of the attributes using a wildcard for unspecified attributes?
As examples:
byAttributes = 'levis#shirts#*#red'
byAttributes = '*#pants#L#*'
I don't wish to use a FilterExpression because it only filters data after a search. I want to take advantage of the attributes being indexed.
No. But you can create alternative GSIs for different combinations.
You can also include a hierarchical SK value and use begins-with to limit based on zero or more values.
Putting some values in the PK and the rest in a hierarchical SK achieves a lot of combinations.
For example have a GSI:
PK = category,
SK = size#brand#color
Now you can query by category, category/size, category/size/brand, or all four.
If it gets more than four you may want to look at Rockset as an indexing system against DynamoDB data.

Using millisecond timestamp as the global secondary index in DynamodDb for range queries?

We have a Dynamodb table Events with about 50 million records that look like this:
{
"id": "1yp3Or0KrPUBIC",
"event_time": 1632934672534,
"attr1" : 1,
"attr2" : 2,
"attr3" : 3,
...
"attrN" : N,
}
The Partition Key=id and there is no Sort Key. There can be a variable number of attributes other than id (globally unique) and event_time, which are required.
This setup works fine for fetching by id but now we'd like to efficiently query against event_time and pull ALL attributes for records that match within that range (could be a million or two items). The criteria would be equal to something like WHERE event_date between 1632934671000 and 1632934672000, for example.
Without changing any existing data or transforming it through an external process, is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query? By my understanding of DynamoDB this isn't possible but maybe there's another configuration I'm overlooking.
Thanks in advance.
(Edit: I rewrote the answer because the OP's comment clarified that the requirement is to query event_time ranges ignoring id. OP knows the table design is not ideal and is trying to make the best of a bad situation).
Is it possible to create a Global Secondary Index using event_date and projecting ALL attributes that could allow a range query?
Yes. You can add a Global Secondary Index to an existing table and choose which attributes to project. You cannot add an LSI to an existing table or change the table's primary key.
Without changing any existing data or transforming it through an external process?
No. You will need to manipulate the attibutes. Although arbitrary range queries are not its strength, DynamoDB has a time series pattern that can be adapted to your query pattern.
Let's say you query mostly by a limitied number of days. You would add a GSI with yyyy-mm-dd PK (Partition Key). Rows are made unique by a SK (Sort Key) that concatenates the timestamp with the id: event_time#id. PK and SK together are the Index's Composite Primary Key.
GSIPK1 = yyyy-mm-dd # 2022-01-20
GSISK1 = event_time#id # 1642709874551#1yp3Or0KrPUBIC
Querying for a single day needs 1 query operation, for a calendar week range needs 7 operations.
GSI1PK = "2022-01-20" AND GSI1SK > ""
Query a range within a day by adding a SK between condition:
GSI1PK = "2022-01-20" AND GSI1SK BETWEEN "1642709874" AND "16427098745"
It seems like one can create a global secondary index at any point.
Below is an excerpt from the Managing Global Secondary Indexes documentation which can be found here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.OnlineOps.html
To add a global secondary index to an existing table, use the UpdateTable operation with the GlobalSecondaryIndexUpdates parameter.

In Teradata there get columns/fields used by join and where condition and respective table without parsing query

I am trying to automate some performance check on query in Teradata.
So as part of that I want to check if columns used in joining condition are primary index of respective table or not and similarly for columns used in where condition are partition column in respective table or not. Is there any direct Teradata query which can directly give this without parsing whole query.
Yes there are two dbc objects where you can query :
dbc.columnsv
dbc.indicesv.
Primary index information will be stored in the 2nd view just search with your tablename and database name.
Partitioned information is stored in columnsv , there is a column with a flag value 'Y' for partitioned columns.
Example :
SELECT DATABASENAME,TABLENAME,COLUMNNAME FROM DBC.COLUMNSV WHERE PARTITIONINGCOLUMN='Y' where tablename=<> and databasename=<>;
Select * from dbc.indicesv where tablename=<> and databasename=<>;

Date difference between two separate rows in SQLite with no ID

I have data in SQLite like this (a few thousands of rows):
1536074432|startRecording
1536074434|stopRecording
1536074443|startRecording
1536074447|stopRecording
1536074458|startRecording
1536074462|stopRecording
And I'd like to get the amounts of seconds passed between two consecutive distinct events (basically how many seconds of video I've recorded).
I know about another similar question (
Date Difference between consecutive rows ), but in my case it's different because I cannot get the "next" row by ID, but I have to get it based on a different event name.
There is an answer that works magic, but it's specific to SQL Server ( Query to find the time difference between successive events ), and I need this for SQLite.
I could do this in Oracle with the LAG / LEAD functions, but no idea how to do it in SQLite.
I could also do this with a separate parsing script, but I think it would be more efficient to be able to do this directly from a query.
Even though there is no id in the table, sqlite stores a rowid (from sqlite CREATE_TABLE doc):
ROWIDs and the INTEGER PRIMARY KEY
Except for WITHOUT ROWID tables, all rows within SQLite tables have a 64-bit signed integer key that uniquely identifies the row within its table. This integer is usually called the "rowid". The rowid value can be accessed using one of the special case-independent names "rowid", "oid", or "rowid" in place of a column name. If a table contains a user defined column named "rowid", "oid" or "rowid", then that name always refers the explicitly declared column and cannot be used to retrieve the integer rowid value.
Assuming perfectly clean data as described :) how about:
select a.rowid,a.time,a.event,b.rowid,b.time,b.event,b.time - a.time as elapsed --,sum(b.time-a.time)
from t2 a, t2 b
where a.rowid % 2 = 1
and b.rowid = a.rowid + 1

Teradata: How to extend the range partition of a non-empty partitioned table?

I have create a table, mydb.mytable, with essentially the following SQL, say last week:
CREATE MULTISET TABLE mydb.mytable ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
master_transaction_header VARCHAR(64) CHARACTER SET LATIN NOT CASESPECIFIC,
demand_date DATE FORMAT 'YY/MM/DD',
item_id BIGINT,
QTY INTEGER,
price DECIMAL(15,2))
PRIMARY INDEX ( master_transaction_header )
PARTITION BY RANGE_N(demand_date BETWEEN DATE '2018-01-01' AND CURRENT_DATE EACH INTERVAL '1' DAY );
When I try to insert data into it, for say yesterday, teradata gives me the following error message
Partitioning violation for table mydb.mytable
When I try to extend the partition using:
ALTER TABLE mydb.mytable MODIFY PRIMARY INDEX (master_transaction_header) ADD RANGE BETWEEN DATE '2018-03-15' AND CURRENT_DATE EACH INTERVAL '1' DAY;
I get the following error message from teradata:
The altering of RANGE_N definition with CURRENT_DATE/CURRENT_TIMESTAMP is not allowed.
I understand that I could:
Create a copy with PARTITION BY RANGE_N(demand_date BETWEEN DATE '2018-01-01' AND '9999-12-31' EACH INTERVAL '1' DAY );
Insert all the data from the old table into the new one
drop the old table
rename the new table
but I am hoping that teradata provides a more elegant way to add partitions to an existing partitioned table.
I have already consulted the following stackoverflow posts:
Range partition table creation with large number of paritions
Teradata: How to add range partition to non empty table?
They were enlightening, but I could not conjure an answer from the discussion therein.
Using CURRENT_DATE for partitioning is possible, but I never found a use case for it.
When you create that table it is resolved to the current date, but not changed afterwards, check the ResolvedCurrent_Date column in dbc.PartitioningConstraintsV. When you submit an ALTER TABLE mydb.mytable TO CURRENT it's resolved again and the range modified.
But there's no reason to do this, simply define the range large enough, so you never have to modify it again, e.g.
PARTITION BY RANGE_N(demand_date
BETWEEN DATE '2018-01-01'
AND DATE '2040-01-01' EACH INTERVAL '1' DAY);
Unused partitions have zero overhead in Teradata.

Resources