I have two different databases that are not connected in any way. In fact, one is a public school database and one is a hud (housing) database. By law they are not allowed to share names and other specific identifying addresses. Birthdates and addresses are okay - along with zip codes and other more general ids. The uses need to be able to query the other database to get non-specific information so it would appear that they need to share the same unique id. I was considering such things as using birthdates and perhaps initials of name or perhaps last 4 digits of ssn along with the birthdate. The client was thinking of global positioning data but I'm concerned about apartments next to one another or moving of families. Any ideas?
First you need to determine what will be your measure of uniqueness. If there are two people in either database with more than one entry for your measure of uniqueness, you need to change your strategy. After that, put a constraint on both databases constraining that these properties(Birthday, SSN) are what make a Person record unique.
Related
I am trying to create a naming convention for different objects in DynamoDB, such as tables, partition and sort keys, LSIs, GSIs, attributes, etc. I read a lot of articles and there is no common way to do that but want to learn from real-time examples to choose which one will fit best our needs.
The infrastructure I am working on is based on microservices. Along with this, some of our development environments share the same AWS account. Based on this, I ended up with something like this:
Tables: [Environment].[Service Name].[Table Name].ddb-table
GSIs/LSIs: [Environment].[Service Name].[Table Name].[GSI/LSI Name].ddb-[gsi/lsi]
Partition Key: pk ??? (in my understanding, the keys should have abstract names, because the single table stores versatile data in the same key)
Sort Key: sk ??? (in my understanding, the keys should have abstract names, because the single table stores versatile data in the same key)
Attributes: meaningful but as short as possible as they are kept for every item in the table
Different elements are separated by dot (.)
All names are separated by dashes (kebab-case) and in lower case
Tables/GSIs/LSIs are in singular form
Here is an example:
Table: dev.user-service.user-order.ddb-table
LSI: dev.user-service.user-order.lsi1pk.ddb-lsi
GSI: dev.user-service.user-order.gsi1pk.ddb-gsi
What naming conventions do you follow?
Thanks a lot in advance!
My advice:
Use PK and SK as your partition key and sort key.
Don't put table names into code. Use ParameterStore. For example, if you ever do a table restore it will be to a new table name, and if you want to send traffic to the new name you'll not want to change code.
Thus don't get too fixed to any particular table name. Never try to have code predict a table name. Only have them be consistent to help humans.
Don't put regions in your table names. When you switch to Global Tables they all keep the same name. Awkward!
GSIs can be called GSI1, GSI2, etc. GSI keys are GSI1PK and GSI1SK, etc.
Tag your tables with their name if you ever want to track per-table costs later.
Short yet meaningful attribute names are nice because it reduces storage and can reduce RCU/WCU if you're near the 4kb or 1kb lines.
Use difference accounts for dev, staging, and production. If you want to put the names into tables as well to help you spot "OMG I'm in production" that's fine.
If you have lots of attributes as the item payload which aren't used for GSIs or filtering and are always returned together, consider just storing them as a string or binary which gets parsed client side. You can even compress them. It's more efficient and lower latency because it skips the data marshaling.
While working with the DICOM study, series and media concepts, I wondered if these values are to be unique over all data, or only to the patient they belong to.
Phrased otherwise; can I have 2 patients having a study/series/sop instance uid that is the same value for both patients?
Or does the DICOM standard simply doesn't care about that and is that open to the implementor to decide?
In DICOM, a Study (identified by its Study Instance UID) is always associated with a single Patient. See DICOM standard part 3 for details.
To answer your initial question/thought: a Unique Identifier (UID) has to be globally unique, i.e. world-wide over all patients, devices, hospitals, etc.
UID in DICOM (no matter what UID) is always globally unique. So, as you asked in question, uniqueness is not limited to Patient level or something.
Following is from specifications:
2017a Part 5 - Data Structures and Encoding (9 Unique Identifiers (UIDs))
Unique Identifiers (UIDs) provide the capability to uniquely identify a wide variety of items. They guarantee uniqueness across multiple countries, sites, vendors and equipment. Different classes of objects, instance of objects and information entities can be distinguished from one another across the DICOM universe of discourse irrespective of any semantic context.
More details about DICOM UID can be found in this answer.
Your comment on question as below:
My question was more about what to do in case I choose to clone a patient in my system and attach the same dicom(s) to it. Should I regenerate the dicom-uid's or could I keep them as-is.
I am not sure what you mean by "clone". While cloning, if there is change in dataset, you should regenerate the SOPInstance UID. Even if you simply apply lossy transfer syntax to your dataset, you should regenerate the SOPInstance UID. Any action that differentiates/separates the the datasets from original require new SOPInstance UID. So, while cloning, if you are changing patient demographics, new UID should be generated. Whether new StudyInstance UID should be generated or not depends upon what is changed.
OTOH, if you are just copying your dataset at different location, it is still same dataset. You do not need to regenerate UIDs in this case.
Unfortunately although the standard states the UID should be globally unique you can not guarantee it at the series level in my experience. I have come across series with duplicate ids across studies. To protect yourself assume you have to use StudyUID +SeriesUID to ensure a unique series key.
I'm building an app that tracks the user's location and updates Firebase. I've read the documentation about structure data but still have a few questions.
I'm considering structuring the data in one of two ways, but can't determine which one.
users
$id
-position
-other attr
vs:
user_position
$id
users
$id
-other attr.
In what scenario would the first design work best, second?
If you only keep one position per user (as seems to be the case by the fact that you use singular user_position), there is no useful difference between the two structures. A user's position in that case is just another attribute, just one that happens to have two value (lat and lon).
But if you want to keep multiple positions per user, then your first structure is mixing entity types: users and user_positions. This is an anti-pattern when it comes to Firebase Database.
The two most common reasons are:
Say you want to show a list of user names (or any specific, single-value attribute). With the first structure you will also need to read the list of all positions of all users, just to get the list of names. With the second structure, you just read the user's attributes. If that is still much more data than you need, consider also keeping a list of /user_names for optimal read performance.
Many developers end up wanting different access rules for the user positions and the other user attributes. In the first structure that is only possible by pushing the read permission from the top /users down to lower in the tree. In the second structure, you can just give separate permissions to /users and /user_positions.
While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.
I have company, customer, supplier etc tables which all have address information related columns.
I am trying to figure out if I should create a new table 'addresses' and separate all address columns to that.
Having address columns on all tables is easy to use and query but I am not sure if it is the right way of doing it from a good design perspective, having these same columns repeat over few tables is making me curious.
Content of the address is not important for me, I will not be checking or using these addresses on any decision making processes, they are purely information related. Currently I am looking at 5 tables that have address information
The answer to all design questions is this:
It depends.
So basically, in the Address case it depends on whether or not you will have more than 1 address per customer. If you will have more than 1, put it in a new Addresses table and give each address a CustomerID. It's overkill (most times, it depends!) to create a generic Address table and map it to the company/customer/supplier tables.
It's also often overkill (and dangerous) to map addresses in a many-to-many relationship between your objects (as addresses can seem to magically change on users if you do this).
The one big rule is: Keep it simple!
This is called Database Normalization. And yes, you want to split them up, if for no other reason because if you need to in the future it will be much harder when you have code and queries in place.
As a rule, you should always design your database in 3rd Normal Form, even for simple apps (there will be a few cases where you won't for performance or logistic reasons, but starting out I would always try to make it 3rd Normal Form, and then learn to cheat after you know the right way of doing it).
EDIT: To expand on this and add some of the comments I have made on other's posts, I am a big believer in starting with a simple design when it comes to code and refactoring when it becomes clear that it is becoming too complex and more indepth object oriented principles would be appropriate. However, refactoring a database that is in production is not so simple. It is all about ROI. It is just too easy to design a normalized database from the outset to justify not doing it. The consequences of a poorly designed database can be catastrophic and it is usually too late before you come to that realization.
Yes, you should separate the addresses to a table of their own. It's a smart thing to know to ask. The key here is that general format of addresses is the same, regardless of who it is; a customer, a company, a supplier... they all have the same fields for addresses.
What makes this worthwhile is the ability to treat addresses as an atomic element; that is, you can generalize all the functionality related to addresses and have it deal with just one table, as opposed to having to worry about it dealing with several tables, and the associated schema drift that can occur.
If you are using those addresses only within the scope of their own tables, there may be no real benefit to moving them to their own tables.
Basically, it doesn't sound like it's worth the effort.
If there's an overlap between tables (i.e. the same organization is entered in both the company and supplier tables), and the address should always be the same in both tables, then it's probably worth moving address off in to its own table and having foreign keys to it from your other three tables. That way, you only have to update it in one spot when it changes.
If the three tables are entirely independent from each other, then there's not really much to gain from moving the data to another table, so you might as well leave it alone.
I think it entirely depends on the purpose of the database. Admittedly all address information is structurally the same and from a theoretical standpoint should all be in a single table linked from the parent table by a key.
However from a performance and query perspective, keeping them in their respective tables does simplify things from a reporting standpoint.
I have a situation with my current company [logistics] where the addresses are actually logically the same - they're all locations regardless of whether they're a pickup location, delivery location, customer etc.
In my case, I'd say that they should most definitely all be in one table. But if it's looking at it from a supplier, customer, contact information standpoint, I'd say that while theoretically it's nice to have the addresses in one table, in practice it won't buy you a whole lot as the data is unlikely to be repeated.
I disagree with Dave. The many-to-many approach (Address <-> User) is both safe, and highly advantageous.
When a customer moves, the addresses in the Address table does NOT change. Instead, the new address is found in the Address table, and the customer etc. is linked to that record. If the new address isn't already in the table, it's added to it.
So do address records themselves ever change? Yes, in cases like these:
it turns out that the address has a typo
US postal service changes the street name
These are the very situations where putting all addresses in one table without repetition pays off; any other arrangement would require an annoying and repetitive data entry.
Of course, if the database is abused, then it would be safer to avoid the many-to-many relationship. But by that token, if the database is in bad hands, it's better to just print everything out, store it in a file cabinet, and verify every transaction against the paper copy. So "protection against misuse" is not a good design principle, in my opinion.