Question about a unique primary key vs unique primary index - teradata

I searched many different topics here on stack over flow related to this question, but I can't seem to find an answer that makes sense to me. I come from a MS SQL Server background where just about every table has a primary key and many more have foreign keys. In Teradata I understand there are Primary keys and also unique primary indexes. It is to my understanding that values for a unique primary index in Teradata can't be null. If that is the case why even use primary keys at all? Is it just to enforce parent child relationships in the table structures like other RDBMS systems such as a sales order header table (PK) and sales order detail table (fk)? Maybe a good use for a UPI without a PK would be a table that has no relationships to another table like a "reporting" table?
Just want to make sure I understand correctly. Thank you for the help.

Related

DynamoDB - GSI versus duplication

I have a question about many-to-many relationships within DynamoDB and what happens on a GSI versus shallow duplication.
Say I want to model the standard many-to-many within social media : a user can follow many other pages and a page has many followers. So, your access patterns are that you need to pull all the followers for a page and you need to see all the pages that a user follows.
If you create an item that has a primary key of the id of the page and a sort key of the user id, this lets you pull all followers for that page.
You could them place a GSI on that item with an inverted index. This would like you call all pages a user is following.
What exactly is happening there? Is DynamoDB duplicating that data somewhere with the keys rearranged? Is this any different that just creating a second item in the table with a primary key of the user and the sort key of the page?
So, you have this item:
Item 1:
PK SK
FOLLOWEDPAGE#<PageID> USER#<UserId>
And you can create a GSI and invert SK and PK, or you could simply create this second item:
Item 2:
FOLLOWINGUSER#<UserId> PAGE#<PageID>
Other than the fact that you now have to maintain this second item, how is this functionally different?
Does a GSI duplicate items with that index?
Does it duplicate items without that index?
Is DynamoDB duplicating that data somewhere with the keys rearranged?
Yes, a secondary index is an opaque copy of your data. As the docs say: A secondary index is a data structure that contains a subset of attributes from a table, along with an alternate key to support Query operations. You choose what data gets copied (DynamoDB speak: projected) to the index.
Is this any different that just creating a second item in the table with a primary key of the user and the sort key of the page?
Apart from the maintenance burden you mention, conceptually they are similar. There are some technical differences between a Global Secondary Index and DIY replication:
A GSI requires separate provisioned concurrency, although the read and write units consumed and storage costs incurred are the same for both approaches.
A GSI is eventually consistent.
A Scan operation will be ~2x worse with the DIY approach, because the table is ~2x bigger.
See the Best practices for using secondary indexes in DynamoDB for optimization patterns.
Does a GSI duplicate items with that index?
Yes.
Does it duplicate items without that index?
No.

How to sort and query DynamoDB by non unique values? I.E. Names

Let's say I make a GSI for 'Name' and I have two people in my database who just happen to have the same name:
Tim Cook
Tim Cook
Now this will fail a consistency constraint on insert for duplicate values hence we need another approach.
I was thinking about hashing the name values at the end so that the BEGINS_WITH operator can still be used to search / match on but that puts you in a weird position. What do you salt with? How many characters? The longer the salt the more memory and potentially compute you waste cleaning up the salt before returning the results to the user. The shorter the salt the more likely you are to have collisions. After all there are some incredibly common names out there.
Here's an example of the values salted:
Tim Cook#ABCDEF
Tim Cook#ZYXWVU
This is great as I can insert both values now and now I can create a 'search user by name' endpoint for the user via the BEGINS_WITH('Tim Cook') operation but it feels weird.
I did a bit of searching though on sorting and searching by names in DynamoDB and didn't come up with anything meaningful on how to proceed from here. Wondering what you guys think.
My one and final issue is that names are not evenly spread out so you're inevitably going to have hotter partitions but I just don't see another way around this. Minus of course exfiltrating the data to another data store and querying it there like a full text search store.
You can’t insert to a GSI. So your concern is kind of misplaced.
You also can’t Get Item on a GSI, only Query, and that’s because there’s not necessarily one matching value for a given key.
Note: The GSI always projects the primary key over from the base table.
You can follow the following schema pattern to achieve your goal:
Partition key: Name
Sort/Range key: createdAt (The creation time of that row)
In this case, if the name is same for more than 1 people, you will be returned with all the names sorted automatically. This schema will also allow you to create a unique access pattern for each item of your table.
Partition key -> Sort key
Name -> createdAt
Tim Cook -> "HH:mm:ss"
Each row will have a different creation time and will provide unique composite key values for each item of the table.
For some reason I thought GSI's had the same uniqueness constraint as partition keys however that's not the case - you can have duplicates.
In a DynamoDB table, each key value must be unique. However, the key values in a global secondary index do not need to be unique.
Source
So a GSI is a perfectly good way to store duplicated information. Not sure this question is helpful now since it came about through ignorance so it might be worth deleting now.

DynamoDB - how to deal with none unique timestamps as sort key?

I have data that has timestamps that I would like to index as a range key so that I can query on time.
The issue is that the timestamp may not be unique across the partition.
For example:
PK
SK
account
2021-08-06T12:40:32Z
account
2021-08-06T12:48:37Z
account
2021-08-06T12:48:37Z
Which wont work. If I make the PK something unique, like this:
PK
SK
12345
2021-08-06T12:40:32Z
12346
2021-08-06T12:48:37Z
12347
2021-08-06T12:48:37Z
Then I can't query across all my data on timestamp because each record is in a different partition.
How would you go about querying time in DynamoDB? Previous examples Ive seen use SK but this only works if the timestamp is unique.
Scan really isn't an option.
Primary keys need to uniquely identify an item in your base table, but GSIs do not have the same requirement.
If you have a requirement for a unique ID and time sorting, you might want to take a look at KSUIDs (or ULIDs).
A KSUID, or K-Sortable Unique Identifier, is a unique identifier with time-based ordering. This lets you have unique identifiers that are sortable by creation time (or another time if needed). You can read a Brief history of the UUID for more details.
KSUIDs are great when you have a need for unique ID's and time sorting. I've found it especially useful in DynamoDB where I often have the need to sort by creation date.
There are KSUID libraries in several programming languages, so you don't need to implement the algorithm yourself. There's even a KSUID generator website that you can use to quickly interact with KSUIDs.
So it seems like partition and sort keys in a GSI do not need to be unique.
If I create a table with just a PK based on individual ID.
I then create a GSI with PK on account and SK on date, I can query the GSI to get the desired result.

DynamoDB: Keys and what they mean

I'm confused as to how to use DynamoDB table keys. The documentation mentions HASH (which seem to also be referred to as Partition) keys and RANGE (or SORT?) keys. I'm trying to roughly align these with my previous understanding of database indexing theories.
My current, mostly guess-based understanding is that a HASH key is essentially a primary key - it must be unique and is automatically indexed for fast-reading - and a RANGE key is basically something you should apply to any other field you plan on querying on (either in a WHERE-like or sorting context).
This is then somewhat confused by the introductions of Local and Global Secondary Indexes. How do they play into things?
If anyone could nudge me in the right direction, bearing in mind my current, probably flawed understanding has come from the docs, I'd be super grateful.
Thanks!
Basically, the DynamoDB table is partitioned based on partition key (otherwise called hash key).
1) If the table has only partition key, then it has to be unique. The DynamoDB table performance based pretty much on the partition key. The good partition key should be a well scattered value (should not have a sequence number as partition key like RDBMS primary key in legacy systems).
2) If the table has both partition key and sort key (otherwise called RANGE key), then the combination of them needs to be unique. It is a kind of concatenation key in RDBMS terms.
However, the usage differs in DynamoDB table. DynamoDB doesn't have a sorting functionality (i.e. ORDER BY clause) across the partition keys. For example, if you have 10 items with same partition key value and different sort key values, then you can sort the result based on the sort key attribute. You can't apply sorting on any other attributes including partition key.
All sort key values of a partition key will be maintained in the same partition for better performance (i.e. physically co-located).
LSI - There can be only one LSI for the table. It should be defined when you create the table. This is kind of alternate sort key for the table
GSI - In order to understand GSI, you need to understand the difference between SCAN and QUERY API in DynamoDB.
SCAN - is used when you don't know the partition key (i.e. full table scan to get the item)
QUERY - is used when you know the partition key (i.e. sort key is optional)
As DynamoDB costing is based on read/write capacity units and for better performance, scan is not the best option for most of the use cases. So, there is an option to create the GSI with alternate partition keys based on the Query Access Pattern (QAP).
GSI Example

Performance issue with primary key

I am populating a medium-sized table (60GB, 500 million rows). The process completes reasonably fast if the table has no primary key (~1 hour using bulk insert), but it takes ~10 times longer if I create that table with the primary key. I assume this is because it takes time to verify the uniqueness constraint and also update the index at each insert.
I thought a good workaround would be to add the primary key later, since indexation on the table that's already populated should be much faster compared to incremental indexation. But sqlite doesn't seem to have the option to add primary key after the table is created (not sure why?).
I guess I could just not use a primary key at all, and instead just add a unique index after the table is populated. Is there any disadvantage to that?
Or any better solution recommended?
From a purely technical point of view, an unique index has exactly the same effect as a primary key. (In SQLite, some primary keys allow NULLs for backwards compatibility.)
The only difference is that the primary key constraint does not show up in the table definition itself, which might be a bad thing for documentation purposes.
Also see Is CREATE UNIQUE INDEX or INTEGER PRIMARY KEY more performant in SQLite.
Run the bulk insert inside a transaction and you'll avoid quite a few things that slow inserts down.
I just found this which is a great write up on how to speed things up in sqlite3.
Improve INSERT-per-second performance of SQLite?

Resources