Node import and performance question - drupal

For one of my clients I have to import a CSV of Medicare plans provided by the government (part one provided here) into Drupal 7. There are about 500,000 rows of data in that CSV, most of which differ only by the FIPS County code field - basically, every county that a plan is available in counts as one row.
Should I import all 500k rows into Drupal 7 as individual nodes, or create a single node for every plan and put the numerous FIPS codes associated with that plan in a multi-value text field? I opted for the latter route to begin with, however when I looked in the plan database it looks like some plans are available in more than 10,000 counties. I'd like to find the most efficient, Drupal-esque solution to storing all these plans and where they are available.

Generally it is very useful to avoid storing any duplicate data, so you are right, create 500k rows as individual nodes is a bad idea. I would rather create two content types (using CCK):
Medicare Plan
FIPS County code (or maybe just County)
And then create a many-to-many relationship between them (using CCK Node Reference, maybe Corresponding node references for mutual relationships if needed).
You can then create a view that will list all FIPS County codes attached to a particular Medicare Plan.

I ended up going with a row per plan - as it turned out, there were subtle differences between them that I missed. Thanks to all who answered!

Related

At what point do you need more than one table in dynamodb?

I am working on an asset tracking system that also manages the concept of "projects". The users of this application perform maintenance activities on their customer's assets, so they need an action log where actions on an asset start life as a task in a project. For example, "Fix broken frame" might be a task where an action would have something like "Used parts a, b, and c to fix the frame" with a completed time and the employee who performed the action.
The conceptual data model for the application starts with a Customer that has multiple locations and each location has multiple assets. Each asset should have an associated action log so it is easy to view previous actions applied to that asset.
To me, that should all go in one table based upon the logical ownership of that data. Customer owns Locations which own Assets which own Actions.
I believe I should have a second table for projects as this data is tangential to the Customer/Location/Asset data. However, because I read so much about how it should all be one table, I'm not sure if this delineation only exists because I've modeled the data incorrectly because I can't get over the 3NF modeling that I've used for my entire career.
Single table design doesn't forbid you to create multiple tables. Instead in encourages to use only a single table per micro-services (meaning, store correlated data, which you want to access together, in the same table).
Let's look at some anecdotes from experts:
Rick Houlihan tweeted over a year ago
Using a table per entity in DynamoDB is like deploying a new server for each table in RDBMS. Nobody does that. As soon as you segregate items across tables you can no longer group them on a GSI. Instead you must query each table to get related items. This is slow and expensive.
Alex DeBrie responded to a tweet last August
Think of it as one table per service, not across your whole architecture. Each service should own its own table, just like with other databases. The key around single table is more about not requiring a table per entity like in an RDBMS.
Based on this, you should answer to yourself ...
How related is the data?
If you'd build using a relational database, would you store it in separate databases?
Are those actually 2 separate micro services, or is it part of the same micro service?
...
Based on the answers to those (and similar) questions you can argue to either keep it in one table, or to split it across 2 tables.

How to model airport/flight data in a graph database like neo4j

I need to model airline flight data in a graph database (I am specifically working with neo4j, though I will consider others if that becomes problematic). My question is more about how to model this data in a way that will ease traversal and discovery of different flight options. A few specific examples of the type of data I would like to both store and later query:
1) A direct flight scenario like JFK->LAX. Seems straightforward, simple two node relationship. But there are many flights that may be of interest between these two nodes. So, if I need to store individual flight detail, is that best in an array on the relationship between the JFK and LAX nodes?
2) A flight scenario with multiple stops, like JFK->LAX->SAN. In this scenario, it seems like there modeling the relationship between the three nodes may be of limited utility if I'm interested in the departure and arrival city? i.e. I could have a relationship from JFK->SAN and the fact that there is a layover in LAX could be a property on that relationship?
If I need to query or traverse the graph based on arrays of data in relationships between nodes, and those arrays become large (e.g. 100 different flights between JFK and LAX), will that introduce performance or scalability problems?
Hopefully this question isn't too open-ended - I'm just trying to avoid building something that works for a small example model with ~5 nodes but can't scale to hundreds of airports and tens of thousands of flights.
Hundreds of airports and tens of thousands of flights is still a very small data set and I'd be surprised if that would be a problem in neo4j.
Off the top of my head you could perhaps have all the airports as their own nodes and each route could be its own node with relationships to all the airports it touches, maybe with an "order" property on each relationship which is local to the route.
(ROUTE1)---------
/ \ \
*order=1/ \*order=2 \*order=3
v v v
(JFK) (LAX) (SAN)
I'm sure there are better solutions.
Check out Neo4J's contribution page
One of the winners of their contest was a gist describing US Flights and Airports it is very well done
This link may be useful for you http://maxdemarzi.com/?s=flights, http://gist.neo4j.org/?6619085

Data structure(s) useful for finding values that can be identified with multiple keys of different types

As an example, consider the storage of hospital records. If John Smith is feeling sick, the doctor might need to look up his record by name to find his medical history. However, the doctor might also need to lookup all patients who experienced the symptoms John experienced to help the diagnosis. In another case, he may need a list of all patients admitted to the hospital at a certain time. What data structure(s) would be used to store patient records and search for them based on name, symptom, date of admission, and possibly other identifiers?
I'll throw this out there: this reads like the use-case for a relational database. Perhaps storing the data in a database and accessing it with queries is a good, long-term solution? If you're interested in the theory/algorithms, you can study how databases solve these problems. Things like indexes, query optimization, etc. are quite deep and probably can't be meaningfully covered here.

Matching unique ids in two different databases

I have two different databases that are not connected in any way. In fact, one is a public school database and one is a hud (housing) database. By law they are not allowed to share names and other specific identifying addresses. Birthdates and addresses are okay - along with zip codes and other more general ids. The uses need to be able to query the other database to get non-specific information so it would appear that they need to share the same unique id. I was considering such things as using birthdates and perhaps initials of name or perhaps last 4 digits of ssn along with the birthdate. The client was thinking of global positioning data but I'm concerned about apartments next to one another or moving of families. Any ideas?
First you need to determine what will be your measure of uniqueness. If there are two people in either database with more than one entry for your measure of uniqueness, you need to change your strategy. After that, put a constraint on both databases constraining that these properties(Birthday, SSN) are what make a Person record unique.

Cube Design - Bridge Tables for Many To Many mapping with additional column

Am making a cube in SQL Server Analysis Services 2005 and have a question about many to many relationships.
I have a many to many relationship between two entities that contains an additional descriptive column as part of the relationship.
I understand that I may need to a bridge table to model the relationship but I am not sure where to store the additional column - in the bridge table or elsewhere?
Many To Many relationsip in SSAS can be implemented via an intermediate fact table that contains both dimension key that subject to the relation.
For example; If you have a cube that has a book-sales-fact table and you want to aggregate total sales by author (which may have many books and a book may be written by many authors) you should also have a author-book intermediate fact table (just like in relational database world). In this bridge table, you should have both dimension keys (Author and Book) plus some measure related to the current book and author such as wages paid to the author to write the book (or chapters).
As a result, if your additional column is kind of a measure you should add that column to the intermediate fact table.

Resources