Algorithm for similarity (of topic) of news items - similarity

I want to determine the similarity of the content of two news items, similar to Google news but different in the sense that I want to be able determine what the basic topics are then determine what topics are related.
So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.
If you can just throw around key words like k-nearest neighbours and a little explanation about why they work (if you can) I will do the rest of the reseach and tweak the algorithm. Just looking for a place to get started, since I know someone out there must have tried something similar before.

First thoughts:
toss away noise words (and, you, is, the, some, ...).
count all other words and sort by quantity.
for each word in the two articles, add a score depending on the sum (or product or some other formula) of the quantities.
the score represent the similarity.
It seems to be that an article primarily about Donald Rumsfeld would have those two words quite a bit, which is why I weight them in the article.
However, there may be an article mentioning Warren Buffet many times with Bill Gates once, and another mentioning both Bill Gates and Microsoft many times. The correlation there would be minimal.
Based on your comment:
So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.
that wouldn't be the case unless the Saddam article also mentioned Iraq (or Donald).
That's where I'd start and I can see potential holes in the theory already (an article about Bill Gates would match closely with an article about Bill Clinton if their first names are mentioned a lot). This may well be taken care of by all the other words (Microsoft for one Bill, Hillary for the other).
I'd perhaps give it a test run before trying to introduce word-proximity functionality since that's going to make it very complicated (maybe unnecessarily).
One other possible improvement would be maintaining 'hard' associations (like always adding the word Afghanistan to articles with Osama bin Laden in them). But again, that requires extra maintenance for possibly dubious value since articles about Osama would almost certainly mention Afghanistan as well.

At the moment I am thinking of something like this.
Each non-noise-word is a dimension. Each article is represented by a vector where the words that don't appear are represented by zero and those that do appear get a value that is equal to the number of times they appear divided by the total words on the page. Then I can take Euclidean distance between each of the points in this space to get the similarity of any two articles.
The next step would be to determine clusters of the articles, and then determine a central point for each cluster. Then compute the Euclidean distance between any two clusters which gives the similarity of the topics.
Baaah I think by typing it out I solved my own problem. Of course only in a very high level way, I am sure when I get down to it I will find problems ... the devil is always in the detail.
But comments and improvements still highly appreciated.

Related

Using OptaPlanner to create school time tables with some tricky constraints

I'm going to use OptaPlanner to lay out time tables for a school.
We're laying out the time tables for a full semester and every week could, if necessary, be slightly different.
There are some tricky constraints to take into account:
1. Weekly schedules
The lectures in one subject should be spread out somewhat evenly over the semester.
We can't for example put 20 math lectures the first week and "be done" with math for this semester.
In fact, it's nice to have some weekly predictibility
"Science year 2 have biology on Tuesday mornings"
This constraint must not be carved in stone however. Some weeks have to include work experience sessions, PE excursions, etc, in which case they must deviate from other weeks.
Problem
If I create a constraint that say, gives -1soft for not scheduling a subject the same time as the previous week, then OptaPlanner will waste a lot of time before it "accidentally" finds a good placement for a lecture, and even if it manages to converge so that each subject is scheduled the same time every week, it will never ever manage to move the entire series of lectures by moving them one by one. (That local optimum will never be escaped.)
2. Cross student group subjects
There's a large correlation between student groups and courses; For example, all students in Science year 2 mostly reads the same courses: Chemistry for Science year 2, Biology for Sience year 2, ...
The exception being language courses.
Each student can choose to study French, German or Spanish. So Spanish for year 2 is studied by a cross section of Science year 2 students, and Social Studies year 2 students, etc.
From the experience of previous (manual) scheduling, the optimal solution it's almost guaranteed to schedule all language classes in the same time slots. (If French is scheduled at 9 on Thursdays, then German and Spanish can be scheduled "for free" at 9 on Thursdays.)
Problem
There are many time slots in one semester, and the chances that OptaPlanner will discover a solution where all language lectures are scheduled at the same time by randomly moving individual lectures is small.
Also, similarly to problem 1: If OptaPlanner does manage to schedule French, German and Spanish at the same time, these "blocks" will never be moved elsewhere, since they are individual lectures, and the chances that all lectures will "randomly" move to the same new slot is tiny. Even with a large Tabu history length and so on.
My thoughts so far
As for problem 1 ("Weekly predictability") I'm thinking of doing the following:
In the construction phase for the full-semester-schedule I create a reduced version of the problem, that schedules (a reduced set of lectures) into a single "template week". Let's call it a "single-week-pre-scheduling". This template week is then repeated in the construction of the initial solution of the full semester which is the "real" planning entity.
The local search steps will then only focus on inserting PE excursions etc, and adjusting the schedule for the affected weeks.
As for problem 2 I'm thinking that the solution to problem 1 might solve this. In a 1 week schedule, it seems reasonable to assume that OptaPlaner will realize that language classes should be scheduled at the same time.
Regarding the local optimum settled by the single-week-pre-scheduling ("Biology is scheduled on Tuesday mornings"), I imagine that I could create a custom move operation that "bundles" these lectures into a single move. I have no idea how simple this is. I would really like to keep the code as simple as possible.
Questions
Are my thoughts reasonable? Is there a more clever way to approach these problems? If I have to create custom moves anyways, perhaps I don't need to construct a template-week?
Is there a way to assign hints or weights to moves? If so, I could perhaps generate moves with slightly larger weight that adjusts scheduling to adhere to predictable weeks and language scheduled in the same time slots.
A question well asked!
With regards to your first problem, I suggest you take a look at OptaWeb Employee Rostering and the concept of rotations. A rotation is "how things generally are" and then Planner has the freedom to diverge from the rotation at a penalty. Once you understand the concept of the rotation from the UI, take a look at the planning entity Shift and how the rotation is implemented with the use of employee and rotationEmployee variables. Note that only the employee is an actual #PlanningVariable, with the rotationEmployee being fixed.
That means that you have to define your rotations manually, therefore doing the work of the solver yourself. However, since this operation is only done once a semester I assume, maybe the solution could be to have a simpler solver generate a reasonable general rotation first, and then a second solver would take it and figure out the specific necessary adjustments?
With regards to your second problem, rotations could help there too. But I'm thinking maybe some move filtering and custom moves to help OptaPlanner to either move all language classes, or none? Writing efficient custom moves is not easy, and filtering stock moves is cumbersome. So I would only do it when the potential of other options is exhausted. If you end up doing this, look for MoveIteratorFactory.
My answer is a little vague, as we do not get into the specifics of the domain model, but for the purposes of designing the overall solution, it hopefully gives enough clues.

Recommendation systems - converting transaction counts to star ratings

I'm doing some exploratory work on recommendation systems and have been reading about collaborative filtering techniques involving user-based, item-based, and SVD algorithms. I am also trying out R's recommenderlab package.
One apparent assumption in the literature is that the user data has labelled items based on a rating scale, e.g. between 1 and 5 stars. I'm looking at problems where the user data does not have ratings but rather just transactions. For example, if I want to recommend restaurants to a user, the only data I have is how often he has visited other restaurants.
How can I convert these "transaction" counts into ratings that can be used by recommendation algorithms that expect a fixed-scale rating? One approach I thought of is simple binning:
0 stars = 0-1 visits
1 star = 2-3 visits
...
5 stars = 10+ visits
However, that doesn't seem like it would work well. For example, if someone visited a restaurant only once, he may still really love it.
Any help would be appreciated.
I would try different approaches. As you said, only visited once may indicate that the user still loves the restaurant but you don't know for sure. Your goal is not to optimize for one single user rather for all users. So for this, you can split your data into training and test data. Train on the training data with different scales and test on the test data.
The different scales may be
a binary scale (0:never visited, 1: visited). This is mostly used in online shops (bought or not). Would support your assuption with the one time visit.
your presented scale or other ranges for the 5 stars. You can also use more than 5 stars. I would potentially not group 0-1 visits.
The approach with the best accuracy should be chosen.
Here's an idea: restaurants the user has visited zero or one times tell you nothing about what they like. Restaurants they have visited many times tell you lots. Why not just look for restaurants similar to those the customer most regularly frequents? In this way, you're using positive information (what they like) but none of the negative since you don't have access to it anyway.
If you absolutely had to infer some continuous measure, I think it would only be sensible to look at the propensity for another visit given past behaviour. This would start with the prior probability of choosing that restaurant (background frequency, or just uniform over restaurants) with a likelihood term related to the number of visits to that restaurant. In this way the more a user visits a restaurant the more likely they are to visit again.

How to mitigate against bandwagon effect (voting behavior) in my ranking system?

What I mean by bandwagon effect describes itself like so:
Already top-ranked items have a higher tendency to get voted on at all, possibly even to get upvoted.
What I am hoping to get is some concrete recommendations, at best based on your practical experience with a mathematical formula and in which situation it helped.
However, any useful pointers are more than welcome!
My ranking system
Please consider a ranking system at a website that has a reputation system and where users cast only upvotes on items and the ranking table is reset to start fresh every month.
Every user has one upvote per item within each month, and there is a reward for users who, within a certain month, upvoted an item that made it into the top ranks at the end of that month.
Users are told the following about what increases the weight of their upvote:
1)... the more reputation you have at the time of upvoting
2)... the fewer items you upvote within the current month (including the current upvote)
3)... the fewer upvotes that item already has within the current month before your own upvote
The ranking table is recalculated once a day and is visible to all.
Goal
I'd like to implement part 3) in an effort to correct the ranks of items where one cannot tell if some users just upvoted it because of the bandwagon effect (those users might hope to gain a "tactical" advantage simply by voting what they perceive lots of other users already upvoted)
Also, I hope to mitigate this way against the possible use of sock puppets that managed to attain some reputation, but upvote the same item or group of items.
Question
Is there a (maybe even tested?) mathematical formula that I could just apply on the time-ordered list of upvotes for each item to get a coffecient for each of those upvotes so that their weights will be corrected in a sensible fashion?
I'm thinking it's got to be something of a lograthmic function but I can't quite get a grip on it...
Thank you!
Edit
Zack says: "beyond a certain level of popularity, additional upvotes decrease the probability that something will be displayed"
To further clarify: what I am after is which actual mathematical approaches are worth trying out that will, in the form of a mathematical function, translate this descrease in pop (i.e., apply coefficients to the weights, see above) in sensible, balanced manner.
My hope is someone has practical experience with such approaches in a simmilar or general situation to the one above.
Consider applying the "Indie Rock Peter Principle": beyond a certain level of popularity, additional upvotes decrease the probability that something will be displayed.
Term coined by Leonard Richardson in this paper. Indie Rock Peter is of course from Diesel Sweeties.
I have always disliked the bandwagon effect in voting systems, especially "most viewed" rankings in which simply clicking on a highly ranked item increases its rank. My solution to this problem, which I have never tested or seen implemented, would be to keep track of how an item was reached (and then voted for), and ignore (or greatly decrease the weight of) votes that came from any sorted-by-ranking page.

Given a collection of consumers competing for a limited resource, allocate that resource to maximize it's applicability

Sorry the question title isn't very clear, this is a challenging question to ask without providing a more concrete example. Consider the following scenario:
I have a number of friends whose birthdays are coming up on dates (d1..dn), and I've managed to come up with a number of gifts I'd like to purchase them of cost (c1..cn). Unfortunately, I only have a fixed amount of money (m) that I can save per day towards purchasing these gifts. The question I'd like to ask is:
What is the ideal distribution of savings per gift (mi, where the sum of mi from 1..n == m) in order to minimize the aggregate deviance between my friends' birthdays and the date in which I'll have saved enough money to purchase that gift.
What I'm looking for is either a solution to this problem, or a mapping to a solved problem that I can utilize to deterministically answer this question. Thanks for pondering it, and let me know if I can provide any additional clarification!
I think you've stated a form of a knapsack problem with some additional complications - the knapsack problem is NP-Complete (p 247, Garey and Johnson). The basic knapsack problem is where you have a number of objects each with a volume and a value - you want to fill a knapsack of fixed volume with the objects to maximize the value without exceeding the knapsack capacity.
Given that you have stages (days) and resources (money) and the resources change by day while you decide what purchases to make, would lead me to a dynamic programming solution technique rather than a straight optimization model.
Could you clarify in comments "minimizing the deviance"? I'm not sure I understand that part.
BTW, mathoverflow.com is probably not helpful for this. If you look at algorithm questions, 50 on stackoverflow and 50 on mathoverflow, you'll find the questions (and answers) on stackoverflow have a lot more in common with the problem you are considering. There is a new site called OR Exchange, but there's not a lot of traffic there yet.

Determining the popularity of a video with ratings and views

I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system.
Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video?
If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad).
If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts.
So, what I need to do is a combination of the two. Barring, of course, spammy views and votes.
What's your guys' thoughts on the subject?
Edit: the following tags were removed: [mysql] [postgresql], to make room for other, more representative tags; the SQL technology used in the intended implementation does not seem to bear much on the considerations regarding the rating model per-se.
You seem to be missing the point that likes and dislikes in movies are anything but objective even within the context of a relatively homogeneous group of "voters". Think how the term "Chix Flix" or the success story called "NetFlix", illustrate this subjectivity...
Yet, if you persist in implementing the model you suggest, there are several hidden variables and system dynamics that need to be acknowledged and possibly taken into account in the rating's formula.
the existence of a third, implicit, value of the vote: "No vote"
i.e. when someone views the movie page and yet doesn't vote, either way.
The problem of dealing with this extra value is its ambiguity: do people not vote because they didn't see the movie or because they neither truly like nor disliked it? Very likely a bit of both, therefore we can/should use the count of the "Page views without vote" in the formula, to boost (somewhat) the rating of movies that do not generate a strong (positive or negative) sentiment (lest the "polarizing" movies will appear more notorious or popular)
the bandwagon effect
Past a certain threshold, and particularly if the rating and/or vote counts is visible before the page view, the rating and vote counts can influence the way people decide to vote (either way) or even decide to abstain from voting. The implication is that the total vote and/or view counts do not relate linearly to the effective rating.
"quality" vs. "notoriety"
Vote ratios in general (eg "likes" / "total" or "likes"/"dislikes" etc.) are indicative of the "quality" of a movie (note the quotes around quality...), whereby the number of votes (and of views) is indicative of the notoriety ("name recognition" etc.) of a movie.
statistical representativity
Very small vote and/or view counts are to be handled carefully because they introduce much volatility in the rating. Phrased otherwise, small samples make for not so statically representative ratings.
trends (the time variable)
At the risk of complicating the model, consider keeping [some] record of when votes/view happened, to allow identifying "hot" (and "cooling") movies in the collection. This info may inform the rating logic, but also may be used to direct the users towards currently hot items. BTW, hence feeding the bandwagon effect mentioned :-( but also, increasing the voting sample size :-).
All these considerations suggest caution in implementing this rating system. It also hints at the likely need of including statistics about the complete set of movies into the rating formula for an individual movie. In other words, do not rate a given movie solely on the basis of the its own vote/view counts but also on say the average vote counts a move receives, the maximum view a movie page gets etc. In fact, an iterative process, whereby movies are [roughly] ranked at first and then the ranking is recalculated by using the statistics of groups of movies similarly rated may provide a better system (provided the formulas are "fair" and somehow converge)
A standard trick is to start with a neutral baseline: say 10 likes and 10 dislikes that gives a score of 1. The first few votes don't change the ratio too much, but as votes accumulate, the baseline is overwhelmed. The exact choice of the baseline values will influence the rating of a new movie (the two values don't have to be equal), and how many votes are needed to change the rating substantially.

Resources