How to design a physics problem Database? - wordpress

I have both problems and solutions to over twenty years of physics PhD qualifying exams that I would like to make more accessible, searchable, and useful.
The problems on the Quals are organized into several different categories. The first category is Undergraduate or Graduate problems. (The first day of the exam is Undergraduate, the second day is Graduate). Within those categories there are several subjects that are tested: Mechanics, Electricity & Magnetism, Statistical Mechanics, Quantum Mechanics, Mathematical Methods, and Miscellaneous. Other identifying features: Year, Season, and Problem number.
I'm specifically interested in designing a web-based database system that can store the problem and solution and all the identifying pieces of information in some way so that the following types of actions could be done.
Search and return all Electricity & Magnetism problems.
Search and return all graduate Statistical Mechanics problems.
Create a random qualifying exam — meaning a new 20 question test randomly picking 2 Undergrad mechanics problems, 2 Undergrade E&M problems, etc. from past qualifying exams (over some restricted date range).
Have the option to hide or display the solutions on results.
Any suggestions or comments on how best to do this project would be greatly appreciated!
I've written up some more details here if you're interested.

For your situation, it seems that it is more important part to implement the interface than the data storage. To store the data, you can use a database table or tags. Each record in the database (or tag) should have the following properties:
Year
Season
Undergradure or Graduate
Subject: CM, EM, QM, SM, Mathematical Methods, and Miscellaneous
Problem number (is it neccesary?)
Question
Answer
Search and return all Electricity & Magnetism problems.
Directly query the database and you will get an array, then display some or all questions.
Create a random qualifying exam — meaning a new 20 question test randomly picking 2 Undergrad mechanics problems, 2 Undergrade E&M problems, etc. from past qualifying exams (over some restricted date range).
To generate a random exam, you should first outline the number of questions for each category and the years it drawn from. For example, if you want 2 UG EM question. Query the database for all UG EM questions and then perform a random shuffling on the question array. Finally, select the first two of them and display this question to student. Continue with the other categories and you will get a complete random exam paper.
Have the option to hide or display the solutions on results.
It is your job to determine whether you want the students to see answer. It should be controlled by only one variable.

Are "Electricity & Magnetism" and "Statistical Mechanics" mutually exclusive categoriztions, along the same dimension? Are there multiple dimensions in categories you want to search for?
If the answer is yes to both, then I would suggest you look into multidimensional data modeling. As a physicist, you've got a leg up on most people when it comes to evaluating the number of dimensions to the problem. Analyzing reality in a multidimensional way is one of the things physicists do.
Sometimes obtaining and learning an MDDB tool is overkill. Once you've looked into multidimensional modeling, you may decide you like the modeling concept, but you still want to implement using relational databases that use the SQL interface.
In that case, the next thing to look into is star schema design. Star schema is quite different from normalization as a design principle, and it doesn't offer the same advantages and limitations. But it's worth knowing in the case where the problem is really a multidimensional one.

Related

Gremlin: Shortest logistical route between A and B while respecting schedules + other constraints

Preface
I'm new to Gremlin and working through Kelvin Lawrence's awesome eBook on the topic in order to solve a specific use-case.
Due to the sheer amount to learn, I'm asking this question to get recommendations on how I might approach the challenge so that, as I read the eBook, I'll better know the sections to which to pay extra attention.
I intend to use AWS Neptune in the pursuit of solving this, so I tagged that topic as well.
Question
Respecting departure/arrival times of legs + other constraints, can the shortest path (the real-world, logistical meaning
of "path") between origin and destination be "queried" (i.e., can I
use the Gremlin console with a single statement)? Or is the
use-case of such complexity that I will effectively need to write a
program to accomplish it?
Use-Case / Detail
I hope to answer the question:
Starting at ORIGIN on DAY, can I get to DESTINATION while
respecting [CONDITIONS]?
The good news is that I only need a true/false response (so limit(1)?) and a lack of a result (e.g., []) suffices for "no".
What are the conditions?
Flight schedules need to be respected. Instead of simple flight routes (i.e., a connection exists between BOSton and DALlas), I have actual flight schedules (i.e., on Wednesday, 9 Nov 2022 at 08:40, flight XYZ will depart BOSton and then arrive DALlas at 13:15) ... consequently, if/when there are connections, I need to respect arrival and departure times + some sort of buffer (i.e., a path for which a Traveler would arrive at 13:05 and depart on another leg at 13:06 isn't actually a valid path);
Aggregate travel time / cost limits. The answer to the question needs to be "No" if a path's aggregate travel time or aggregate cost exceeds specified limits. (Here, I believe I'll need to use sack() to track the cost - financial and time - of each leg and bail out of the repeat() until loop when either is hit?)
I apologize b/c I know this isn't a good StackOverflow question, since it's not technically specific -- my hope is that, at least, some specific technical recommendations might result.
The use-case seems like the varsity / pro version of the flight routes example presented in the eBook, which is perfect for someone brand-new to Gremlin ... 😅
There are a number of ways you might model this. One way I have seen used effectively is to essentially have two graphs. This first just knows about routes. You use that one to find ways to get from A to Z in x-hops. Then using the second graph, which tracks actual flights, using the results from the first search you look for flights within the time constraints you need to impose. So there is really the data modeling question and then the query writing part. Obviously the data model should enable the queries to be as efficient as possible.
There are a couple of useful blog posts related to your question. They mention Neo4j but are really quite generic and mainly focus on the data modeling aspects of your question.
https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/
https://maxdemarzi.com/2017/05/24/flight-search-with-neo4j/
I would focus on the data model, and once you have that, focus on the Gremlin queries. Amazon Neptune also now supports openCypher as an alternative property graph query language.
If you already have a data model worked out and can share a sample, I'm happy to update the answer with an example query or two.

Market Basket Analysis - single model for variable number of features?

I am using Apriori to build a recommender system to go along with my company's application. Before going down this road, I'd like to confirm with someone that has more experience that I am on the right track. Any help is appreciated.
Let me try to explain the issue. Depending on the context of the user within the application, the features that impact the recommendations can vary. For example, imagine a shopping scenario. If I shop at HEB, I usually have a predefined grocery list so the items on that list would be good recommendations if I just told the app I was going to HEB. When I go to Home Depot though, I tend to shop by department, so power tools and the associated parts are good recommendations if I tell the app I'm at Home Depot and I am doing shopping for power tools.
You see that the number of features varies in the two scenarios. In the first, my recommendations depend solely on the store while in the second, they depend on the store and the department in which I'm shopping.
I am looking to use a single Apriori model that can handle this type of situation. Would that be considered a best practice or is it better to have different models, one for when we just list the store and another for when we list the store and the department? Given that Apriori is an unsupervised algorithm, I think it can be done with one model, but wanted to double check since I don't have a ton of experience.
It seems to me like you are talking about multi-level association rules. This is from the manual page of the aggregate function in arules:
Support for Item Hierarchies
Description:
Often an item hierarchy is available for datasets used for
association rule mining. For example in a supermarket dataset
items like "bread" and "beagle" might belong to the item group
(category) "baked goods."
I guess the higher-level categories would be your departments and stores. This will be able to find associations between items, departments and stores.

Categorical Clustering of Users Reading Habits

I have a data set with a set of users and a history of documents they have read, all the documents have metadata attributes (think topic, country, author) associated with them.
I want to cluster the users based on their reading history per one of the metadata attributes associated with the documents they have clicked on. This attribute has 7 possible categorical values and I want to prove a hypothesis that there is a pattern to the users' reading habits and they can be divided into seven clusters. In other words, that users will often read documents based on one of the 7 possible values in the particular metadata category.
Anyone have any advice on how to do this especially in R, like specific packages? I realize that the standard k-means algorithm won't work well in this case since the data is categorical and not numeric.
Cluster analysis cannot be used to prove anything.
The results are highly sensitive to normalization, feature selection, and choice of distance metric. So no result is trustworthy. Most results you get out are outright useless. So it's as reliable as a proof by example.
They should only be used for explorative analysis, i.e. to find patterns that you then need to study with other methods.

Why does Hyperloglog work and which real-world problems?

I know how Hyperloglog works but I want to understand in which real-world situations it really applies i.e. makes sense to use Hyperloglog and why? If you've used it in solving any real-world problems, please share. What I am looking for is, given the Hyperloglog's standard error, in which real-world applications is it really used today and why does it work?
("Applications for cardinality estimation", too broad? I would like to add this simply as a comment but it won't fit).
I would suggest you turn to the numerous academic research of the subject; usually academic papers contain some information of "prior research on the subject" as well as "applications for which the subject has been used". You could start with traversing the references of interest as referenced by the following article:
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm, by P. Flageolet et al.
... This problem has received a great deal of attention over the past
two decades, finding an ever growing number of applications in
networking and traffic monitoring, such as the detection of worm
propagation, of network attacks (e.g., by Denial of Service), and of
link-based spam on the web [3]. For instance, a data stream over a
network consists of a sequence of packets, each packet having a
header, which contains a pair (source–destination) of addresses,
followed by a body of specific data; the number of distinct header
pairs (the cardinality of the multiset) in various time slices is an
important indication for detecting attacks and monitoring traffic, as
it records the number of distinct active flows. Indeed, worms and
viruses typically propagate by opening a large number of different
connections, and though they may well pass unnoticed amongst a huge
traffic, their activity becomes exposed once cardinalities are
measured (see the lucid exposition by Estan and Varghese in [11]).
Other applications of cardinality estimators include data mining of
massive data sets of sorts—natural language texts [4, 5], biological
data [17, 18], very large structured databases, or the internet graph,
where the authors of [22] report computational gains by a factor of
500+ attained by probabilistic cardinality estimators.
At my work, HyperLogLog is used to estimate the number of unique users or unique devices hitting different code paths in online services. For example, how many users are affected by each type of service error? How many users use each feature? There are MANY interesting questions HyperLogLog allows us to answer.
Stackoverflow might use hyperloglog to count the views of each question. Stackoverflow wants to make sure that one user can only contribute one view per item so every view is unique.
It could be implemented with set. every question would have a set that stores the usernames:
question#ID121e={username1,username2...}
For each question creating a set would take up some space and consider how many questions have been asked on this platform. The total amount of space to keep track of every view per user would be huge. But hyperloglog uses about 12 kB of memory per key no matter how many usernames are added, even 10 million views.

Google prediction API - Building classifier training data

EDIT: I'm trying to classify new user review to predefined set of tags. Each review can have multiple tags associated to it.
I've mapped my DB user reviews to 15 categories, The following example shows the text, reasoning the mapped categories
USER_REVIEWS | CATEGORIES
"Best pizza ever, we really loved this place, our kids ..." | "food,family"
"The ATV tour was extreme and the nature was beautiful ..." | "active,family"
pizza:food
our kids:family
The ATV tour was extreme:active
nature was beautiful:nature
EDIT:
I tried 2 approaches of training data:
The first includes all categories in a single file like so:
"food","Best pizza ever, we really loved this place, our kids..."
"family","Best pizza ever, we really loved this place, our kids..."
The second approach was splitting the training data to 15 separate files like so:
family_training_data.csv:
"true" , "Best pizza ever, we really loved this place, our kids..."
"false" , "The ATV tour was extreme and the nature was beautiful ..."
Non of the above were conclusive, and missed tagging most of the times.
Here are some questions that came up, while I was experimenting:
Some of my reviews are very long (more than 300 words), should I limit the words on my training data file, so it will match the average review word count (80)?
Is it best to separate the data to 15 training data files, with TRUE/FALSE option, meaning: (is the review text of a specific category), or mix all categories in one training data file?
How can I train the model to find synonyms or related keywords, so it can tag "The motorbike ride was great" as active although the training data had a record for ATV ride
Iv'e tried some approaches as described above, without any good results.
Q: What training data format would give the best results?
I'll start with the parts I can answer with the given information. Maybe we can refine your questions from there.
Question 3: You can't train a model to recognize a new vocabulary word without supporting context. It's not just that "motorbike" is not in the training set, but that "ride" is not in the training set either, and the other words in the review do not relate transportation. The cognitive information you seek is simply not in the data you present.
Question 2: This depends on the training method you're considering. You can give the each tag as a separate feature column with a true/false value. This is functionally equivalent to 15 separate data files, each with a single true/false value. The one-file method gives you the chance to later extend to some context support between categories.
Question 1: The length, itself, is not particularly relevant, except that cutting out unproductive words will help focus the training -- you won't get nearly as many spurious classifications from incidental correlations. Do you have a way to reduce the size programmatically? Can you apply that to the new input you want to classify? If not, then I'm not sure it's worth the effort.
OPEN ISSUES
What empirical evidence do you have that 80% accuracy is possible with the given data? If the training data do not contain the theoretical information needed to accurately tag that data, then you have no chance to get the model you want.
Does your chosen application have enough intelligence to break the review into words? Is there any cognizance of word order or semantics -- and do you need that?
After facing similar problems, here are my insights regarding your questions:
According to WATSON Natural Language Classifier documentation it is best to limit the length of input text to fewer than 60 words, so I guess using your average 80 words will produce better results
You can go either way, but separate files will produce a more unambiguous results
creating a a synonym graph, as suggested would be a good place to start, WATSON is aimed to answer a more complex cognitive solution.
Some other helping tips from WATSON guidelines:
Limit the length of input text to fewer than 60 words.
Limit the number of classes to several hundred classes. Support for larger
numbers of classes might be included in later versions of the service.
When each text record has only one class, make sure that each class is
matched with at least 5 - 10 records to provide enough training on
that class.
It can be difficult to decide whether to include multiple
classes for a text. Two common reasons drive multiple classes:
When the text is vague, identifying a single class is not always clear.
When experts interpret the text in different ways, multiple classes
support those interpretations.
However, if many texts in your training
data include multiple classes, or if some texts have more than three
classes, you might need to refine the classes. For example, review
whether the classes are hierarchical. If they are hierarchical,
include the leaf node as the class.

Resources