I have a data set with a set of users and a history of documents they have read, all the documents have metadata attributes (think topic, country, author) associated with them.
I want to cluster the users based on their reading history per one of the metadata attributes associated with the documents they have clicked on. This attribute has 7 possible categorical values and I want to prove a hypothesis that there is a pattern to the users' reading habits and they can be divided into seven clusters. In other words, that users will often read documents based on one of the 7 possible values in the particular metadata category.
Anyone have any advice on how to do this especially in R, like specific packages? I realize that the standard k-means algorithm won't work well in this case since the data is categorical and not numeric.
Cluster analysis cannot be used to prove anything.
The results are highly sensitive to normalization, feature selection, and choice of distance metric. So no result is trustworthy. Most results you get out are outright useless. So it's as reliable as a proof by example.
They should only be used for explorative analysis, i.e. to find patterns that you then need to study with other methods.
Related
EDIT: I'm trying to classify new user review to predefined set of tags. Each review can have multiple tags associated to it.
I've mapped my DB user reviews to 15 categories, The following example shows the text, reasoning the mapped categories
USER_REVIEWS | CATEGORIES
"Best pizza ever, we really loved this place, our kids ..." | "food,family"
"The ATV tour was extreme and the nature was beautiful ..." | "active,family"
pizza:food
our kids:family
The ATV tour was extreme:active
nature was beautiful:nature
EDIT:
I tried 2 approaches of training data:
The first includes all categories in a single file like so:
"food","Best pizza ever, we really loved this place, our kids..."
"family","Best pizza ever, we really loved this place, our kids..."
The second approach was splitting the training data to 15 separate files like so:
family_training_data.csv:
"true" , "Best pizza ever, we really loved this place, our kids..."
"false" , "The ATV tour was extreme and the nature was beautiful ..."
Non of the above were conclusive, and missed tagging most of the times.
Here are some questions that came up, while I was experimenting:
Some of my reviews are very long (more than 300 words), should I limit the words on my training data file, so it will match the average review word count (80)?
Is it best to separate the data to 15 training data files, with TRUE/FALSE option, meaning: (is the review text of a specific category), or mix all categories in one training data file?
How can I train the model to find synonyms or related keywords, so it can tag "The motorbike ride was great" as active although the training data had a record for ATV ride
Iv'e tried some approaches as described above, without any good results.
Q: What training data format would give the best results?
I'll start with the parts I can answer with the given information. Maybe we can refine your questions from there.
Question 3: You can't train a model to recognize a new vocabulary word without supporting context. It's not just that "motorbike" is not in the training set, but that "ride" is not in the training set either, and the other words in the review do not relate transportation. The cognitive information you seek is simply not in the data you present.
Question 2: This depends on the training method you're considering. You can give the each tag as a separate feature column with a true/false value. This is functionally equivalent to 15 separate data files, each with a single true/false value. The one-file method gives you the chance to later extend to some context support between categories.
Question 1: The length, itself, is not particularly relevant, except that cutting out unproductive words will help focus the training -- you won't get nearly as many spurious classifications from incidental correlations. Do you have a way to reduce the size programmatically? Can you apply that to the new input you want to classify? If not, then I'm not sure it's worth the effort.
OPEN ISSUES
What empirical evidence do you have that 80% accuracy is possible with the given data? If the training data do not contain the theoretical information needed to accurately tag that data, then you have no chance to get the model you want.
Does your chosen application have enough intelligence to break the review into words? Is there any cognizance of word order or semantics -- and do you need that?
After facing similar problems, here are my insights regarding your questions:
According to WATSON Natural Language Classifier documentation it is best to limit the length of input text to fewer than 60 words, so I guess using your average 80 words will produce better results
You can go either way, but separate files will produce a more unambiguous results
creating a a synonym graph, as suggested would be a good place to start, WATSON is aimed to answer a more complex cognitive solution.
Some other helping tips from WATSON guidelines:
Limit the length of input text to fewer than 60 words.
Limit the number of classes to several hundred classes. Support for larger
numbers of classes might be included in later versions of the service.
When each text record has only one class, make sure that each class is
matched with at least 5 - 10 records to provide enough training on
that class.
It can be difficult to decide whether to include multiple
classes for a text. Two common reasons drive multiple classes:
When the text is vague, identifying a single class is not always clear.
When experts interpret the text in different ways, multiple classes
support those interpretations.
However, if many texts in your training
data include multiple classes, or if some texts have more than three
classes, you might need to refine the classes. For example, review
whether the classes are hierarchical. If they are hierarchical,
include the leaf node as the class.
I am building a local OLAP cube based on data gathered from several OLTP sources. Please note that I am doing this programmatically and do not have access to tools like SSAS or MDX-based tools.
My requirements are somewhat different than the operational requirements of the OLTP system users. I know that "in theory" it would be preferable to retain the most atomic grain available to me, but I don't see a reason to include the lowest level of data in the cube.
For example (I am simplifying), I have a measure field like "Price". Additionally, each sales fact has a Version attribute with values such as:
List (Original/Initial)
Initial Quote
Adjusted Quote
Sold
These describe the internal development of our pricing and are critical to the reports that I create.
However, for my reporting purposes, I will always want to know the value of all Versions whenever I am referencing a given transaction. Therefore, I am considering pivoting measures like Price by Version in the cube (Version will still be its own entity in the data model), resulting in measures like:
PriceList
PriceQuotedInitial
PriceQuotedAdjusted
PriceSold
Since only one Version is ever effective at a given point in time, we do not need to aggregate across multiple Versions.
Known Advantages
Since this will be a local cube file, it appears this approach would
simplify the creation of several required calculated measures that compare Price
across different Versions (would not be an issue to create calculated measures at various levels of aggregation if I was doing this with MDX)
It would also reduce the number of records by a factor of between 3
and 6, which would significantly boost performance for a local cube.
Known Disadvantages
While the data model will match the business process, the cube would not store the data at the most atomic level. An analyst would need to distinguish between Versions by Measure selection, and could not filter by Version - they would always get all available Versions.
This approach will greatly increase the number of Measures. For
example, there is not just one Price we are tracking, but several
price components and other Measures we track for each transaction.
So if we track a dozen true Measures for each transaction, that
might end up being 50-60 Measures if I take this approach.
I understand that for very large Fact tables, it would be preferable to factor all possible fields out of the Fact table into Dimensions for performance purposes, but I am not sure whether this is the case when using a local cube, as in all likelihood, I will put fewer than 50,000 records into any given cube file, given the limitations of local cubes.
Are there other drawbacks to this approach that I'm missing?
Problem:
I want to suggest the top 10 most compatible matches for a particular user, by comparing his/her 'interests' with interests of all others. I'm building an undirected weighted graph between users, where the weight = match score between the two users.
I already have a set of N users: S. For any user U in S, I have a set of interests I. After a long time (a week?) I create a new user U with a set of interests and add it to S. To generate a graph for this new user, I'm comparing interest set I of the new user with the interest sets of all the users in S, iteratively. The problem is with this "all the users" part.
Let's talk about the function for comparing interests. An interest in a set of interests I is a string. I'm comparing two strings/interests using WikipediaMiner (it uses Wikipedia links to infer how closely related two strings are. eg. Billy Jean & Thriller ==> high match, Brad Pitt & Jamaica ==> low match blah blah). I've asked a question about this too (to see if there's a better solution than the one I'm currently using.
So, the above function takes non-negligible time, and in total, it'll take a HUGE time when we compare thousands (maybe millions?) of users and their hundreds of interests. For 100,000 users, I can't afford to make 100,000 user comparisons in a small time (<30sec) in this way. But, I have to give the top 10 recommendations within 30 secs, possibly a preliminary recommendation, and then improve on it in the next 1 min or so, calculate improved recommendations. Simply comparing 1 user vs the N users sequentially is too slow.
Question:
Please suggest an algorithm, method or tool using which I can improve my situation or solve my problem.
I could think of only an approach to solve the problem, since the outcomes of below stuff
depend on the nature of inter-relation between interests.
=>step:1 As your title says.Build an undirected weighted graph with interests as vertices and the weighted match between them as edges.
=>step:2 - cluster the interests. (Most complex)
Kmeans is a commonly used clustering algo, but works on based on
K-Dimensional vector space.refer wiki to see how K-means works.
it minimizes the sum of (sum of distance^2 for each point and say the center of the cluster) for all clusters. In your case, there are no dimensions available. so try if you can apply the minimizing logic applied there by creating some kind of rule, for distance between two vertices, higher match => lesser distance and vice versa (what are the different matching levels provided by wiki-miner?). chose the Mean of cluster as say the most connected vertex in the chosen set, page ranking sounds to be a good option for "figuring the most connected vertex ".
"Pair-counting F-Measure" sounds like it suit's your need (weighted graph), check for other options available.
(Note: keep modifying this step untill a right clustering algo is found and
the right calibration for distance rule, no of clusters etc are found. )
=>Step:3 - Evaluate the clusters
from here on its like calibrating a couple things to fit your need.
Examine the clusters, reevaluate :
the number of clusters , inter-cluster distance, distance between vertices inside clusters, size of clusters,
time\precision trade-off (compare final - match results without any clustring)
goto: step-2 untill this evaluation is satisfactory.
=>step:4 - Examinie new inerest
iterate thru all clusters, calculate conectivity in each cluster, sort clusters based on high connectivity, for the top x% of sorted clusters
sort and filter out the highly connected interests.
=>step:5 - Match User
reverse look up set of all users using the interests obtained out of step-4, compare all interests for both users, generate a score.
=>step:6 - Apart form the above
you can distribute the load (multiple machines can be used for clusters machine-n clusters) to multiple systems\processors, based on the traffic and stuff.
what is the application for this problem, whats the expected traffic?
Another solution to find the connectivity between the new interest and "set of interests in Cluster" C.
Wiki-Miner runs on a set of wiki documents, let me call it the UNIVERSE.
1:for each cluster fetch and maintain(index, lucene might be handy) the "set of high relevent docs"(I am calling it HRDC) out of the UNIVERSE. so you have 'N' HRDC's if you got 'N' clusters.
2:when a new interest comes find "Conectivity with Cluster" = "Hit ratio of interest in HRDC/Hit ratio of interest in UNIVERSE" for each HRDC.
3:Sort "Conectivity with Cluster"'s and choose the Highly connected clusters.
4:Either compare all the vertices in the cluster with the new interest or the highly connected vertices (using Page Ranking), depending on the time\Precision trade off , that suits you.
One flaw is that your basing your algorithms complexity on the wrong thing. The real issue is that you have to compare each unique interest against every other unique interest (and that interest against itself).
If all of the interests are unique, then there is probably nothing you can do. However, if you have a lot of duplicate interests you can perhaps speed up the algorithm this way by the following.
Create a graph that associates each interest with the users that have that interest. In such a way that allows for fast look-ups.
Create a graph that shows how each interest relates to each other interest, also in such a way that allows for fast look-ups.
Therefore, when a new user is added, their interests are compared to all other interest and stored in a graph. You can then use that information to build to build a list of users with similar interests. That list of users will then need to be filtered somehow to bring it down to the top 10.
Finally, add that user and their interests to the graph of users and interests. This is done last so that the user with the most closely matched interests isn't the user themselves.
Note:
There might be some statistical short cuts that you could do something like this: A is related to B, B is related to C, C is related to D, therefore A is related to B, C, and D. However, to use those kinds of short cuts likely requires a much better understanding of how your comparison function works, which is a bit beyond my expertise.
Approximate solution:
I forgot to mention it earlier, but what your looking when comparing users or interests is a "Nearest neighbor search" in higher dimensions. Meaning, that for exact solutions, a linear search generally works better than data structures. So approximation is probably the best way to go if you need it faster.
To obtain a quick approximate solution (without guarantees as to how close it is), you'll need a data structure that allows for quickly being able to determine which users are likely to be similar to a new user.
One way to build that structure:
Pick 300 random users. These will be the seed users for 300 clusters. Ideally, you'd use the 300 users that are least closely related, but that's probably not practical, still might be wise to ensure that the no seed user is too closely related to the other users (as a sum or average of it's comparison's to other users).
The clusters are then filled by each user joining the cluster whose representative user most closely matches it.
The top ton can then be determined by picking the top 10 users most closely related users from that cluster.
If you ensure that the number of clusters and the users per cluster is always fairly close to sqrt(number of users), then you obtain a fair approximation in O(sqrt(N)) by only checking the points within the cluster. You can improve that approximation by including users in additional clusters and checking the representative users for each cluster. The more clusters you check, the closer you get towards O(N) and an exact solution. Although, there's probably no way to say how close the current solution is to the exact solution. Chances are you start to hit dimishing returns after checking more than a total of log(sqrt(N)) clusters total. Which would put you at O(sqrt(N) log(sqrt(N))).
few thoughts ...
Not exactly a graph theory solution.
assuming a finite set of interests. for each user maintain a bit sequence where each interest is a bit representing whether the user has that interest or not.
For a new user simply multiply the bit sequence with the existing users bit sequence and find the number of bits in the result which gives an idea of how closely their interests match.
I have both problems and solutions to over twenty years of physics PhD qualifying exams that I would like to make more accessible, searchable, and useful.
The problems on the Quals are organized into several different categories. The first category is Undergraduate or Graduate problems. (The first day of the exam is Undergraduate, the second day is Graduate). Within those categories there are several subjects that are tested: Mechanics, Electricity & Magnetism, Statistical Mechanics, Quantum Mechanics, Mathematical Methods, and Miscellaneous. Other identifying features: Year, Season, and Problem number.
I'm specifically interested in designing a web-based database system that can store the problem and solution and all the identifying pieces of information in some way so that the following types of actions could be done.
Search and return all Electricity & Magnetism problems.
Search and return all graduate Statistical Mechanics problems.
Create a random qualifying exam — meaning a new 20 question test randomly picking 2 Undergrad mechanics problems, 2 Undergrade E&M problems, etc. from past qualifying exams (over some restricted date range).
Have the option to hide or display the solutions on results.
Any suggestions or comments on how best to do this project would be greatly appreciated!
I've written up some more details here if you're interested.
For your situation, it seems that it is more important part to implement the interface than the data storage. To store the data, you can use a database table or tags. Each record in the database (or tag) should have the following properties:
Year
Season
Undergradure or Graduate
Subject: CM, EM, QM, SM, Mathematical Methods, and Miscellaneous
Problem number (is it neccesary?)
Question
Answer
Search and return all Electricity & Magnetism problems.
Directly query the database and you will get an array, then display some or all questions.
Create a random qualifying exam — meaning a new 20 question test randomly picking 2 Undergrad mechanics problems, 2 Undergrade E&M problems, etc. from past qualifying exams (over some restricted date range).
To generate a random exam, you should first outline the number of questions for each category and the years it drawn from. For example, if you want 2 UG EM question. Query the database for all UG EM questions and then perform a random shuffling on the question array. Finally, select the first two of them and display this question to student. Continue with the other categories and you will get a complete random exam paper.
Have the option to hide or display the solutions on results.
It is your job to determine whether you want the students to see answer. It should be controlled by only one variable.
Are "Electricity & Magnetism" and "Statistical Mechanics" mutually exclusive categoriztions, along the same dimension? Are there multiple dimensions in categories you want to search for?
If the answer is yes to both, then I would suggest you look into multidimensional data modeling. As a physicist, you've got a leg up on most people when it comes to evaluating the number of dimensions to the problem. Analyzing reality in a multidimensional way is one of the things physicists do.
Sometimes obtaining and learning an MDDB tool is overkill. Once you've looked into multidimensional modeling, you may decide you like the modeling concept, but you still want to implement using relational databases that use the SQL interface.
In that case, the next thing to look into is star schema design. Star schema is quite different from normalization as a design principle, and it doesn't offer the same advantages and limitations. But it's worth knowing in the case where the problem is really a multidimensional one.
I apologize as I don't know whether this is more of a math question that belongs on mathoverflow or if it's a computer science question that belongs here.
That said, I believe I understand the fundamental difference between data, information, and knowledge. My understanding is that information carries both data and meaning. One thing that I'm not clear on is whether information is data. Is information considered a special kind of data, or is it something completely different?
The words data,information and knowlege are value-based concepts used to categorize, in a subjective fashion, the general "conciseness" and "usefulness" of a particular information set.
These words have no precise meaning because they are relative to the underlying purpose and methodology of information processing; In the field of information theory these have no meaning at all, because all three are the same thing: a collection of "information" (in the Information-theoric sense).
Yet they are useful, in context, to summarize the general nature of an information set as loosely explained below.
Information is obtained (or sometimes induced) from data, but it can be richer, as well a cleaner (whereby some values have been corrected) and "simpler" (whereby some irrelevant data has been removed). So in the set theory sense, Information is not a subset of Data, but a separate set [which typically intersects, somewhat, with the data but also can have elements of its own].
Knowledge (sometimes called insight) is yet another level up, it is based on information and too is not a [set theory] subset of information. Indeed Knowledge typically doesn't have direct reference to information elements, but rather tells a "meta story" about the information / data.
The unfounded idea that along the Data -> Information -> Knowledge chain, the higher levels are subsets of the lower ones, probably stems from the fact that there is [typically] a reduction of the volume of [IT sense] information. But qualitatively this info is different, hence no real [set theory] subset relationship.
Example:
Raw stock exchange data from Wall Street is ... Data
A "sea of data"! Someone has a hard time finding what he/she needs, directly, from this data. This data may need to be normalized. For example the price info may sometimes be expressed in a text string with 1/32th of a dollar precision, in other cases prices may come as a true binary integer with 1/8 of a dollar precision. Also the field which indicate, say, the buyer ID, or seller ID may include typos, and hence point to the wrong seller/buyer. etc.
A spreadsheet made from the above is ... Information
Various processes were applied to the data:
-cleaning / correcting various values
-cross referencing (for example looking up associated codes such as adding a column to display the actual name of the individual/company next to the Buyer ID column)
-merging when duplicate records pertaining to the same event (but say from different sources) are used to corroborate each other, but are also combined in one single record.
-aggregating: for example making the sum of all transaction value for a given stock (rather than showing all the individual transactions.
All this (and then some) turned the data into Information, i.e. a body of [IT sense] Information that is easily usable, where one can quickly find some "data", such as say the Opening and Closing rate for the IBM stock on June 8th 2009.
Note that while being more convenient to use, in part more exact/precise, and also boiled down, there is not real [IT sense] information in there which couldn't be located or computed from the original by relatively simple (if only painstaking) processes.
An financial analyst's report may contain ... knowledge
For example if the report indicate [bogus example] that whenever the price of Oil goes past a certain threshold, the value of gold start declining, but then quickly spikes again, around the time the price of coffee and tea stabilize. This particular insight constitute knowledge. This knowledge may have been hidden in the data alone, all along, but only became apparent when one applied some fancy statistically analysis, and/or required the help of a human expert to find or confirm such patterns.
By the way, in the Information Theory sense of the word Information, "data", "information" and "knowlegde" all contain [IT sense] information.
One could possibly get on the slippery slope of stating that "As we go up the chain the entropy decreases", but that is only loosely true because
entropy decrease is not directly or systematically tied to "usefulness for human"
(a typical example is that a zipped text file has less entropy yet is no fun reading)
there is effectively a loss of information (in addition to entropy loss)
(for example when data is aggregate the [IT sense] information about individual records get lost)
there is, particular in the case of Information -> Knowlege, a change in level of abstration
A final point (if I haven't confused everybody yet...) is the idea that the data->info->knowledge chain is effectively relative to the intended use/purpose of the [IT-sense] Information.
ewernli in a comment below provides the example of the spell checker, i.e. when the focus is on English orthography, the most insightful paper from a Wallstreet genius is merely a string of words, effectively "raw data", some of it in need of improvement (along the orthography purpose chain.
Similarly, a linguist using thousands of newspaper articles which typically (we can hope...) contain at least some insight/knowledge (in the general sense), may just consider these articles raw data, which will help him/her create automatically French-German lexicon (this will be information), and as he works on the project, he may discover a systematic semantic shift in the use of common words betwen the two languages, and hence gather insight into the distinct cultures.
Define information and data first, very carefully.
What is information and what is data is very dependent on context. An extreme example is a picture of you at a party which you email. For you it's information, but for the the ISP it's just data to be passed on.
Sometimes just adding the right context changes data to information.
So, to answer you question: No, information is not a subset of data. It could be at least the following.
A superset, when you add context
A subset, needle-in-a-haystack issue
A function of the data, e.g. in a digest
There are probably more situations.
This is how I see it...
Data is dirty and raw. You'll probably have too much of it.
... Jason ... 27 ... Denton ...
Information is the data you need, organised and meaningful.
Jason.age=27
Jason.city=Denton
Knowledge is why there are wikis, blogs: to keep track of insights and experiences. Note that these are human (and community) attributes. Except for maybe a weird science project, no computer is on Facebook telling people what it believes in.
information is an enhancement of data:
data is inert
information is actionable
note that information without data is merely an opinion ;-)
Information could be data if you had some way of representing the additional content that makes it information. A program that tries to 'understand' written text might transform the input text into a format that allows for more complex processing of the meaning of that text. This transformed format is a kind of data that represents information, when understood in the context of the overall processing system. From outside the system it appears as data, whereas inside the system it is the information that is being understood.