Yes/No attributes in logical database design - erd

I have an assignment where I have to design a logical model using SQL developer.
I am converting a conceptual model to a logical model. And I Have a relation NURSE that has "nurse_id" and "certification". The certification attribute has yes/no values.
My question is:
Should I move the yes/no attribute to a new relation? or is it okay to keep it in the same NURSE relation. What is the best practise.
and is the suitable data type for that attribute (CHAR)?
Thank you,

Keep it in the NURSE relation as it would be easy to query how many nurses have certification and how many don't, and which nurses have certification.
You can use either CHAR(1) and type in Y or N. You can use a BIT datatype if database supports it. You can use a Boolean if database supports it. Since all major databases will have a CHAR(1), I'd just keep it CHAR(1)

Because the certification only has yes/no values, I would recommend keeping them in the same table as a one to one mapping. If one nurse could get several certifications, then another table would be useful as a many to one or one to many mapping.
As for the data type, CHAR is fine. If you want to save space you can also use BOOLEAN, then parse this as yes/no in the application.

Related

AWS DynamoDB Naming Convention

I am trying to create a naming convention for different objects in DynamoDB, such as tables, partition and sort keys, LSIs, GSIs, attributes, etc. I read a lot of articles and there is no common way to do that but want to learn from real-time examples to choose which one will fit best our needs.
The infrastructure I am working on is based on microservices. Along with this, some of our development environments share the same AWS account. Based on this, I ended up with something like this:
Tables: [Environment].[Service Name].[Table Name].ddb-table
GSIs/LSIs: [Environment].[Service Name].[Table Name].[GSI/LSI Name].ddb-[gsi/lsi]
Partition Key: pk ??? (in my understanding, the keys should have abstract names, because the single table stores versatile data in the same key)
Sort Key: sk ??? (in my understanding, the keys should have abstract names, because the single table stores versatile data in the same key)
Attributes: meaningful but as short as possible as they are kept for every item in the table
Different elements are separated by dot (.)
All names are separated by dashes (kebab-case) and in lower case
Tables/GSIs/LSIs are in singular form
Here is an example:
Table: dev.user-service.user-order.ddb-table
LSI: dev.user-service.user-order.lsi1pk.ddb-lsi
GSI: dev.user-service.user-order.gsi1pk.ddb-gsi
What naming conventions do you follow?
Thanks a lot in advance!
My advice:
Use PK and SK as your partition key and sort key.
Don't put table names into code. Use ParameterStore. For example, if you ever do a table restore it will be to a new table name, and if you want to send traffic to the new name you'll not want to change code.
Thus don't get too fixed to any particular table name. Never try to have code predict a table name. Only have them be consistent to help humans.
Don't put regions in your table names. When you switch to Global Tables they all keep the same name. Awkward!
GSIs can be called GSI1, GSI2, etc. GSI keys are GSI1PK and GSI1SK, etc.
Tag your tables with their name if you ever want to track per-table costs later.
Short yet meaningful attribute names are nice because it reduces storage and can reduce RCU/WCU if you're near the 4kb or 1kb lines.
Use difference accounts for dev, staging, and production. If you want to put the names into tables as well to help you spot "OMG I'm in production" that's fine.
If you have lots of attributes as the item payload which aren't used for GSIs or filtering and are always returned together, consider just storing them as a string or binary which gets parsed client side. You can even compress them. It's more efficient and lower latency because it skips the data marshaling.

How to sort and query DynamoDB by non unique values? I.E. Names

Let's say I make a GSI for 'Name' and I have two people in my database who just happen to have the same name:
Tim Cook
Tim Cook
Now this will fail a consistency constraint on insert for duplicate values hence we need another approach.
I was thinking about hashing the name values at the end so that the BEGINS_WITH operator can still be used to search / match on but that puts you in a weird position. What do you salt with? How many characters? The longer the salt the more memory and potentially compute you waste cleaning up the salt before returning the results to the user. The shorter the salt the more likely you are to have collisions. After all there are some incredibly common names out there.
Here's an example of the values salted:
Tim Cook#ABCDEF
Tim Cook#ZYXWVU
This is great as I can insert both values now and now I can create a 'search user by name' endpoint for the user via the BEGINS_WITH('Tim Cook') operation but it feels weird.
I did a bit of searching though on sorting and searching by names in DynamoDB and didn't come up with anything meaningful on how to proceed from here. Wondering what you guys think.
My one and final issue is that names are not evenly spread out so you're inevitably going to have hotter partitions but I just don't see another way around this. Minus of course exfiltrating the data to another data store and querying it there like a full text search store.
You can’t insert to a GSI. So your concern is kind of misplaced.
You also can’t Get Item on a GSI, only Query, and that’s because there’s not necessarily one matching value for a given key.
Note: The GSI always projects the primary key over from the base table.
You can follow the following schema pattern to achieve your goal:
Partition key: Name
Sort/Range key: createdAt (The creation time of that row)
In this case, if the name is same for more than 1 people, you will be returned with all the names sorted automatically. This schema will also allow you to create a unique access pattern for each item of your table.
Partition key -> Sort key
Name -> createdAt
Tim Cook -> "HH:mm:ss"
Each row will have a different creation time and will provide unique composite key values for each item of the table.
For some reason I thought GSI's had the same uniqueness constraint as partition keys however that's not the case - you can have duplicates.
In a DynamoDB table, each key value must be unique. However, the key values in a global secondary index do not need to be unique.
Source
So a GSI is a perfectly good way to store duplicated information. Not sure this question is helpful now since it came about through ignorance so it might be worth deleting now.

How to create surrogate Key in BigData

We are planning to move our Transactional data into BigData platform and do the analysis there. One challenge we faced is how can we create auto-increment in bigData. We need it to generate Surrogate keys.
Most common approach is to use a type 3 UUID, i.e. a pseudo-random identifier with extremely, extremely low collision chance.
If you really need sequential (or at least monotonic) identifiers for some reason, then you will need to generate them from a single source, and this single source may need to be separated out as a service, e.g. Twitter Snowflake.
Yes. I agree with UUID approach.
but please make sure that you refactor your ER model to have proper balance between normalised and deNormalised entity.
If you move your existing application ER model as is in BigData architecture then it would slow down performance as it might have to do joins with BigTable.
Also make sure that you know your Key to access data is strong and not changing when data is updated while storing in NoSql database
This link will give u some idea about above
Transition-RDBMS-NoSQL
relational-databases-vs-non-relational-databases

Using auto-number database fields theory

I was on "another" programming forum, and we were talking about getting the next number from an auto-increment field BEFORE an insert takes place (there is a way using ADOX). This was in an MS-Access database btw.
Anyway, the discussion veered off into the area of SHOULD you use auto-increment fields for things like invoice numbers, PO numbers, bill of lading numbers, or anything else that needs an unique, incrementing number.
My thoughts were "why not"? Other people are arguing that an Invoice number (for instance) should be managed as a separate table and incremented with code, not using an auto-number field.
Can someone give me a good reason why that would be true?
I've used auto-number fields for years for just this type of thing and have never had problem one.
Your thoughts?
I have always avoided number auto_increment. As it turns out for good reason. But originally my reasons were because that was what the professor told us.
Facebook had a major breach a few years ago - simply because they were use AUTO_INCREMENT fields for user id's. Doesn't take a calculator to figure out that if my ID is 10320 there is likely someone with ID 10319, etc.
When debugging (or proofing design) having a key that implicit of the data it represents is a heck of a lot easier.
Have keys that are implicit of the data reduces the potencial for corrupted data (type's and user guessing).
Implicit keys require the developer think about they're data. I have never come across a table using implicit keys that was not normalized.
Other than the fact deadlines often run tight - there is no great reason for auto increment.
Normally I use and autonumbering field for the ID so I don't need to think about how's generated.
The recordset operation like insert and delete alter the sequence skipping block of numbers.
When you manage CustomerID, Invoice Numbers and so on, it's better to have the full control over them instead of letting them under system's control.
You can create a function that generates for you the desired numbers using a rule (e.g. the invoice can be a function that include the invoicing date).
With autonumbering you can't manage this.
After that there is NO FIXED RULES about what to do and what not do.
It's just your practice and experience and the degree of freedom you want to have.
Bye:-)

Matching unique ids in two different databases

I have two different databases that are not connected in any way. In fact, one is a public school database and one is a hud (housing) database. By law they are not allowed to share names and other specific identifying addresses. Birthdates and addresses are okay - along with zip codes and other more general ids. The uses need to be able to query the other database to get non-specific information so it would appear that they need to share the same unique id. I was considering such things as using birthdates and perhaps initials of name or perhaps last 4 digits of ssn along with the birthdate. The client was thinking of global positioning data but I'm concerned about apartments next to one another or moving of families. Any ideas?
First you need to determine what will be your measure of uniqueness. If there are two people in either database with more than one entry for your measure of uniqueness, you need to change your strategy. After that, put a constraint on both databases constraining that these properties(Birthday, SSN) are what make a Person record unique.

Resources