I am dealing with a roster with 15,000 unique employees. Depending on their 'Designation' they either impact performance or do not. The issue is, these employees could change their designation any day. The roster is as simple as this:
AgentID
AgentDesignation
Date
I feel like I would be violating some Normalization rules if I just have duplicate values (the agent has the same designation from the previous day, for example). Would I really want to create a new row for each date even if the Designation is the same? I want to always be able to get the agent's correct designation on a particular date.
All calculations are done with Excel, probably with Vlookup. Anyone have some tips?
The table structure you propose would not be a violation of normalization -- it contains a PRIMARY KEY (AgentID, Date) and a single attribute that is dependent on all elements of the key (AgentDesignation). Furthermore, it's easy (using the PRIMARY KEY constraint) to ensure that there is one-and-only-one designation per agent per day. The fact that many PRIMARY KEY values will yield the same dependent value does not mean the database is not correctly normalized.
An alternative approach using date ranges would likely result in fewer rows but guaranteeing integrity would be harder and searches for a particular value would be costlier.
Related
I have a simple table which contains one unique partition key id and a bunch of other attributes including a date attribute.
I now want to get all records in a specific time range however as far as I understood, the only way to do this is to use a scan.
I tried to use a GSI on date but then I can not use BETWEEN in the KeyConditionExpression.
Is there any other option?
Q: Are you providing one-and-only-one Partition Key value?
A: If YES, then you can query. If NO, it's a scan.
You are currently in scan territory, because you need to search over multiple ids.
To get to the promised land of queries, consider DynamoDB's design pattern for time series data. One implementation would be to add a GSI with a compound Primary Key representing the date. Split the date between a PK and SK. Your PK could be YYYY-MM, for instance, depending on your query patterns. The SK would get the leftover bits of the date (e.g. DD). Covering a date range would mean executing one or several queries on the GSI.
This pattern has many variants. If scale is a challenge and you are mostly querying a known subset of recent days, for instance, you could consider replicating records to a separate reporting table configured with the above keys and a TTL field to expire old records. As always, the set of "good" DynamoDB solutions is determined by your query patterns and scale.
Let's say I make a GSI for 'Name' and I have two people in my database who just happen to have the same name:
Tim Cook
Tim Cook
Now this will fail a consistency constraint on insert for duplicate values hence we need another approach.
I was thinking about hashing the name values at the end so that the BEGINS_WITH operator can still be used to search / match on but that puts you in a weird position. What do you salt with? How many characters? The longer the salt the more memory and potentially compute you waste cleaning up the salt before returning the results to the user. The shorter the salt the more likely you are to have collisions. After all there are some incredibly common names out there.
Here's an example of the values salted:
Tim Cook#ABCDEF
Tim Cook#ZYXWVU
This is great as I can insert both values now and now I can create a 'search user by name' endpoint for the user via the BEGINS_WITH('Tim Cook') operation but it feels weird.
I did a bit of searching though on sorting and searching by names in DynamoDB and didn't come up with anything meaningful on how to proceed from here. Wondering what you guys think.
My one and final issue is that names are not evenly spread out so you're inevitably going to have hotter partitions but I just don't see another way around this. Minus of course exfiltrating the data to another data store and querying it there like a full text search store.
You can’t insert to a GSI. So your concern is kind of misplaced.
You also can’t Get Item on a GSI, only Query, and that’s because there’s not necessarily one matching value for a given key.
Note: The GSI always projects the primary key over from the base table.
You can follow the following schema pattern to achieve your goal:
Partition key: Name
Sort/Range key: createdAt (The creation time of that row)
In this case, if the name is same for more than 1 people, you will be returned with all the names sorted automatically. This schema will also allow you to create a unique access pattern for each item of your table.
Partition key -> Sort key
Name -> createdAt
Tim Cook -> "HH:mm:ss"
Each row will have a different creation time and will provide unique composite key values for each item of the table.
For some reason I thought GSI's had the same uniqueness constraint as partition keys however that's not the case - you can have duplicates.
In a DynamoDB table, each key value must be unique. However, the key values in a global secondary index do not need to be unique.
Source
So a GSI is a perfectly good way to store duplicated information. Not sure this question is helpful now since it came about through ignorance so it might be worth deleting now.
This seems like such an elementary part of databases, cannot believe Dynamo does not do this.
Supposing I have a Case. I have 2 dates: when the Case became active, and when it became inactive. I want to write a query that would return the count of active cases for a given Date.
In SQL (and MySQL has special Date indices), I could do an expression 'where :date between active and inactive.' Cannot do this in dynamo for a bunch of reasons:
there is no date type
there only seem to be concatenated keys since everything is a hash hence no between
So far the only things I have come up with were:
Sharding - should probably shard this table, I did some reading on that and the way Dynamo does sharding seems simple, although kinda sucks that you end up with 2 tables
if I do this, then I can just ask for the active count each day and store it
which means if I wanted count for a day in the past, I would have to table scan, and worse, scan 2 tables (as I understand it)
Date Partitions - the problem here is which date do we partition on, I guess activation, then the presumption is a count for a given Date would have a key expression of active <= :date, and a filter expression of inactive is null
Distinct Events - if I am recording Events on each case, the count of active cases on a given date is also the distinct set of CaseIDs in the Events table for that date, but that looks like it's not easy to do
Still reading so would not be surprised if I am missing something obvious. Actually one other possible way to do this is move the event data to Timestream and then have it compute this aggregate.
I have a table structure consisting of cities and comments. I need to get all comments related to a city. I have made my primary key for comments the name of the city. Now when I query my table I can get all the comments related to the city but I need them in order of votes for that comment. the votes value are constantly changing. I have considered adding ordered by to my query or adding vote as a range key and deleting are re-adding the recored every time the votes changes. These solutions don't seam that efficient and was wondering if there is a better way of doing it?
One easy thing you could do is to use a local secondary index - this DynamoDB feature can create a second table whose hash key is the same (the city name), but the sort key is the number of votes - which remains just an ordinary attribute in your original table. DynamoDB will automatically - and consistently - take care of the second table for you as you modify the first one.
Using a LSI is easier than coding the extra deletion and addition, and more efficient in the sense of less network activity and client work - but may not be significantly cheaper in Amazon bills, because DynamoDB charges you extra for that LSI work.
I am new to OLAP,if I have two fact tables can they share the same dimension table?
A good example would be if I have tables fact1 and fact2, can they both have a foreign key into a single Date dimension (dimDate) table? Or, do I need/should create separate dimDate dimension tables for each separate fact?
To me, and based on my research, I don't see any downfall of sharing a dim table, but wanted to check.
Thanks!
They can, and should.
That's the whole point of conformed dimensions, keeping the attributes in a single place, so as to avoid multiple versions of truth coming from different fact tables.
So a single date dimension, with all the necessary attributes for each fact table, which is then linked from each fact table that needs it.
Same for a customer dimension. If you have a sales fact table that needs customer info such as billing address and a marketing fact table that holds info about campaigns each customer can benefit from, you would combine all those attributes in a single table. Some customers may not be referenced in the marketing fact table, others may not exist in the fact table, but all would exist in the single customer dimension, which is your single source of truth about who your customers are.