Data structure(s) useful for finding values that can be identified with multiple keys of different types - lookup-tables

As an example, consider the storage of hospital records. If John Smith is feeling sick, the doctor might need to look up his record by name to find his medical history. However, the doctor might also need to lookup all patients who experienced the symptoms John experienced to help the diagnosis. In another case, he may need a list of all patients admitted to the hospital at a certain time. What data structure(s) would be used to store patient records and search for them based on name, symptom, date of admission, and possibly other identifiers?

I'll throw this out there: this reads like the use-case for a relational database. Perhaps storing the data in a database and accessing it with queries is a good, long-term solution? If you're interested in the theory/algorithms, you can study how databases solve these problems. Things like indexes, query optimization, etc. are quite deep and probably can't be meaningfully covered here.

Related

Firebase / NoSQL - How to aggregate data for statistics

I'm creating my first ever project with Firebase, and I come to the point when I need some statistics based on user input. I know Firebase (or NoSQL databases in general) are not ideal for statistics but they work for me in any other cases so I would like to give it a try.
What I have:
I work on the application where people can invite a friend to work for their company, so I have a collection of "referrals" where ID of each referral is basically UserID of a user to who the referral belongs, and then there is a subcollection with name "items" where data are stored.
How my data looks like:
Each item have these data:
applicant
appliedDate
position(part of position is positionId & department on which this position is coming from)
status
What I wanted is to let user to make statistics based on:
date range
status
department
What I was thinking about:
It's probably not the best idea to let firebase iterate over all referrals once users make requests as it may get really expensive on firebase. What I was thinking of is using cloudfunctions to calculate statistics always when something change e.g. when a new applicant applies I will increase the counter by one and the same for a counter to a specific department. However I feel like this make work for total numbers or for predefined queries e.g. "LAST MONTH" but once I will not know what dates user will select it start to get tricky.
Any idea how can I design something like this?
Thanks a lot!
What you're considering is the idiomatic approach to calculate aggregated in Firestore, and most NoSQL databases. If you follow this pattern, Firestore is quite well suited to storing statistics.
It's ad-hoc statistic, like the unknown data range, that are trickier. Usually this comes down to storing the right values to allow you to get rid of the need to read an unknown number of documents to calculate a value.
For example, if you store counters for the statistics per month, week, day and hour, you can satisfy a wide range of date ranges with a limited number of read operations. You may need to read multiple documents, but the number of documents to read depends on the range, and not on the total number of documents in the database.
Of course, for the most flexible ad-hoc querying, you may still want to consider another solution, such as BigQuery, which was made precisely for this use-case.

Basic firebase database structure design decision

Situation: In the app we have up to 1000 schools. Every school has students and students are having lessons and are joining events (and more). We need to query quick and often lessons per student, per school per date. We have 2 designs in mind, wondering the best way to proceed.
1 - design with dedicated school node
2 - design with no dedicated school node
Examples of two designs
PRO design 1
- root ref to school user after login. noo need to query on school id's
- no need to mention school id's everywhere
- no need for node lessens per school and events per school
- rules on school level
...
PRO design 2
- more flatten data, as widely advised on the internet
For most NoSQL database structures, flattening and denormalising data is the best method. And that is exactly the case with Firebase too.
When you flatten your data, you get the following advantages :-
You're mostly only downloading the minimum required amount. That leads to efficiency and cost-effectiveness.
Your downloads are much faster - specially compared to the likes of SQL join queries.
Having said that, in your particular case, I think that it really depends on how much the school affects the logged in user.
Suppose that a school is only an attribute for a student, and serves no other purpose, then the second database is the way to go. For example, if the books a student can get are independent of the school she goes to, then the second database style is more suited.
However, if a school categories students into groups that define their interaction with the database, then the first database structure is the way to go. An example of this is that a student can only get a book when its available in the school she goes to.
Regardless of your decision, I would like to commend you on the fact that you have flattened your database quite well in both your structures! And my personal suggestion would be to go with the one that is more convenient to code, read and maintain for you.

Structuring Data In Firebase

I'm contemplating using Firebase for an upcoming project, but before doing so I want to make sure I can develop a data structure that will meet my purposes. I'm interested in tracking horse race results for approximately 25 racetracks across the US. My initial impression was that my use case aligned nicely with the Firebase Weather Data Set. The Weather data set is organized by city and then in various time series: currently, hourly, daily and minutely.
My initial thought was that I could follow a similar approach and use the 25 tracks as cities and then organize by years, months, days and races.
This structure lends itself nicely to accessing data from a particular track, but suppose that also want to access data across all tracks. For example, access data for all tracks for races that occurred in 2014 and had more than 10 horses.
Questions:
Does my proposed data structure limit me to queries by track only or would I still be able to query across tracks, years, days, months, etc. and incorporate any and all of the various meta data attributes: number of horses, distance of race, etc.
Given, my interest in freeform querying is there another data structure that would be more advantageous?
Is Firebase similar to Mongodb and have issues with collections (lists) that grow or can one continue to push to a list without pre allocating or worrying about sharding?
I believe my confusion stems from url/path nature of the data storage.
EDIT:
Here is a sample of what I had in mind:
Thanks for your input.
I would think that you would want to organize by horse first. I guess it depends what you are deriving from the data.
One horse could be at different tracks.
Horses table
* Horsename
-----Date
-----Track
-----Racenumber
-----Gate
-----Jockey
-----Place
-----Odds
-----Mud?
Races table
----Track
----Racenumber
----Date
----Time
----NumberOfHorses
Link the tables and you could get at any one part of it.

Node import and performance question

For one of my clients I have to import a CSV of Medicare plans provided by the government (part one provided here) into Drupal 7. There are about 500,000 rows of data in that CSV, most of which differ only by the FIPS County code field - basically, every county that a plan is available in counts as one row.
Should I import all 500k rows into Drupal 7 as individual nodes, or create a single node for every plan and put the numerous FIPS codes associated with that plan in a multi-value text field? I opted for the latter route to begin with, however when I looked in the plan database it looks like some plans are available in more than 10,000 counties. I'd like to find the most efficient, Drupal-esque solution to storing all these plans and where they are available.
Generally it is very useful to avoid storing any duplicate data, so you are right, create 500k rows as individual nodes is a bad idea. I would rather create two content types (using CCK):
Medicare Plan
FIPS County code (or maybe just County)
And then create a many-to-many relationship between them (using CCK Node Reference, maybe Corresponding node references for mutual relationships if needed).
You can then create a view that will list all FIPS County codes attached to a particular Medicare Plan.
I ended up going with a row per plan - as it turned out, there were subtle differences between them that I missed. Thanks to all who answered!

Data Warehouse Design Question

In my OLTP database I have a layout consisting of instructors and students. Each student can be a student of any number of instructors. A student can also sign up for an instructor, but not necessarily book any tuition (lesson).
In a data warehouse, how best would this be modelled? If I create a dimension table for Lessons, Instructors and Students and a fact table for the lessons students have taken then this will work when an instructor wants to see what lessons a student has taken.
However, how will an instructor see how many students are REGISTERED with the instructor but has not yet taken a lesson?
In my OLTP, I have a many to many table (InstructorStudents) that links each student with one more more instructors. In an OLAP database, this isn't appropriate.
What would be the best schema in this case? Would a many to many be appropriate in this instance? I can't store a list of which students are registered to which instructors in the student table, so I feel another dimension table is necessary but cannot work out what should be contained in it.
If a fact represents a transaction, you seem to have two different facts here: Sign ups & Lessons. There are always a lot of ways to go but, perhaps, you need two fact tables. They may have similar dimensionality except the sign-up table will have a Class dimension (class name, instructor name, etc.). The Lessons table will tie to the class dimension but, also, to a Lesson dimension (date, classroom used, etc.).
There are a few other ways to do this but they will be more difficult from a programming & reporting perspective.
You need a many to many dimensional model.
You need a factless fact table. Look at the following resource that refers to an example close to your need
http://www.kimballgroup.com/1996/09/02/factless-fact-tables/

Resources