Choosing a similarity metric for user-scores of television shows - similarity

I have a database of user ratings of various television shows on a 1-10 scale. I've been trying to find a good way of determining how similar two user score-lists are two one another for shared shows.
It feels like the most obvious way to do this is just to take the absolute value of the difference. And then sum/average that for all shared shows. But I was reading this does not take into account how users will rate things on different scales. I saw some people saying cosine similarity is better for this sort of thing. Unfortunately, I've run into a lot of cases where that metric doesn't really make sense.
Example:
overall average of user1 = 8.1
overall average of user2 = 5.8
scores for shared shows only:
S1 = [8,8,10,10,10,10,6,8,10,5,6,10]
S2 = [5,6,7,8,9,9,4,5,9,1,2,8]
Obviously, these two people rated the shows they watched pretty differently. When I use the average difference it says they are not very similar (2.3 where 0 is the same). When I use something like the cosine similarity it says they are extremely similar (0.97 where 1 is the same).
Is there a metric that would be better suited for this kind of thing? My ultimate goal is to recommend users shows from other users that have similar tastes to them.

Related

Check if any combination of binary variables is correlated/has impact on an ordinal dependent variable

I am working on a case to finish my (not so advanced) data scientist course and I have already been helped a lot by topics here, thanks!
Unfortunately now I am stuck again and cannot find an existing answer.
My data comes from a bike shop and I want to see if products bought during customers' first registered purchase are related to/have impact on how important they will become to the shop in the future. I have grouped customers into 5 clusters (from those who registered and made never any registered purchase again, through these who made 2-3 purchases for little money, those who made a few purchases for a lot of money to those who purchase stuff regularly and really bring a lot of money to this bike shop), I have ordered them into an ordinal dependent variable.
As the independent variables I have prepared 20+ binary variables that identify products/services bought during the first purchase from this shop (first purchase as a registered customer). One row per customer. So I want to check the idea if there are combinations of products (probably "extras" to the bike purchase) that can increase the chance that a customer would register and hopefully stay as a loyal customer for the future.
The dream would be be able to say, for example, if you buy a cheap or middle-cheap bike during this first purchase you probably don't contribute so much to the bike shop in a long term so you have low grade on the dependent variable. But those who bought a middle-cheap bike AND a helmet AND a lock (probably to special price) are more likely to become one of the loyal registered customers bringing money for a longer time.
There might be no relation like that but I want to test that anyways. Implementation of the result could be being able to recommend an extra product during a purchase (with a good price on it).
I am learning R during this course. We went through some techniques and first I was imagining it would be possible to work with the neural networks (just cause it sounded most fun to try), having all these products as input in the sparse matrix and the customers clusters as the output (I hoped it was similar to the examples I read about with sparse matrix with pixels from a picture as the input and numbers 1-9 as the output) but then I was told that this actually is based on pictures and real patterns and in my case I don't even know if there is any.
Then I was thinking I could try with the ordinal forest. But it doesn't predict my clusters well, not at all (2 out of 5 clusters get no predictions). But that is OK, I don't expect the first purchase to be able to predict all the customers future. But I would really want to see if there are combinations of products that might increase the chance that a customer ends up in one of the "higher" clusters on the loyalty scale.
I am not sure if this was clear enough. :) Do you think that there is any way of testing my idea? What could I try to do? Let me know if you need more information.

Recommendation systems - converting transaction counts to star ratings

I'm doing some exploratory work on recommendation systems and have been reading about collaborative filtering techniques involving user-based, item-based, and SVD algorithms. I am also trying out R's recommenderlab package.
One apparent assumption in the literature is that the user data has labelled items based on a rating scale, e.g. between 1 and 5 stars. I'm looking at problems where the user data does not have ratings but rather just transactions. For example, if I want to recommend restaurants to a user, the only data I have is how often he has visited other restaurants.
How can I convert these "transaction" counts into ratings that can be used by recommendation algorithms that expect a fixed-scale rating? One approach I thought of is simple binning:
0 stars = 0-1 visits
1 star = 2-3 visits
...
5 stars = 10+ visits
However, that doesn't seem like it would work well. For example, if someone visited a restaurant only once, he may still really love it.
Any help would be appreciated.
I would try different approaches. As you said, only visited once may indicate that the user still loves the restaurant but you don't know for sure. Your goal is not to optimize for one single user rather for all users. So for this, you can split your data into training and test data. Train on the training data with different scales and test on the test data.
The different scales may be
a binary scale (0:never visited, 1: visited). This is mostly used in online shops (bought or not). Would support your assuption with the one time visit.
your presented scale or other ranges for the 5 stars. You can also use more than 5 stars. I would potentially not group 0-1 visits.
The approach with the best accuracy should be chosen.
Here's an idea: restaurants the user has visited zero or one times tell you nothing about what they like. Restaurants they have visited many times tell you lots. Why not just look for restaurants similar to those the customer most regularly frequents? In this way, you're using positive information (what they like) but none of the negative since you don't have access to it anyway.
If you absolutely had to infer some continuous measure, I think it would only be sensible to look at the propensity for another visit given past behaviour. This would start with the prior probability of choosing that restaurant (background frequency, or just uniform over restaurants) with a likelihood term related to the number of visits to that restaurant. In this way the more a user visits a restaurant the more likely they are to visit again.

Game story where a lower score is better

My company is producing a racing game where the best score is the fastest time. Facebook publishes the time as a regular point score, where a higher score is better. This of course is turning it all upside down.
Is there a way to control how a game's score shown in a story? Ideally we would like to show "seconds" instead of points as well.
No, the Scores API currently only supports 'higher is better' for scores.
If you can't rework your scoring scheme to take this into account, consider using Open Graph actions instead - you can have the aggregations which appear on a user's Timeline ordered by whichever field of the object and action you need them to be ordered by,

Ratingsystem that considers time and activity

I'm looking for a rating system that does not only weight the rating on number of votes, but also time and "activity"
To clarify a bit:
Consider a site where users produce something, like a picture.
There is another type of user that can vote on other peoples pictures (on a scale 1-5), but one picture will only recieve one vote.
The rating a productive user gets is derived from the rating his/hers pictures have recieved, but should be affected by:
How long ago the picture was made
How productive the user has been
A user who's getting 3's and 4's and still making 10 pictures per week should get higher rating than a person that have gotten 5's but only made 1 pic per week and stopped a few month ago.
I've been looking at Bayesian estimate, but that only considers the total amount of votes independent of time or productivity.
My math fu is pretty strong, so all I need is a nudge in right direction and I can probably modify something to fit my needs.
There are many things you could do here.
The obvious approach is to have your measure of the scores decay with time in your internal calculations, for example using an exponential decay with a time constant T. For example, use value = initial_score*exp(-t/T) where t is the time that's passed since picture was submitted. So if T is one month, after one month this score will contribute 1/e, or about 0.37 that it originally did. (You can also do this differentially, btw, with value -= (dt/T)*value, if that's more convenient.)
There's probably a way to work this with a Bayesian approach, but it seems forced to me. Bayesian approaches are generally about predicting something new based on a (usually large) set of prior data, which doesn't directly match your model.

Determining the popularity of a video with ratings and views

I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system.
Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video?
If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad).
If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts.
So, what I need to do is a combination of the two. Barring, of course, spammy views and votes.
What's your guys' thoughts on the subject?
Edit: the following tags were removed: [mysql] [postgresql], to make room for other, more representative tags; the SQL technology used in the intended implementation does not seem to bear much on the considerations regarding the rating model per-se.
You seem to be missing the point that likes and dislikes in movies are anything but objective even within the context of a relatively homogeneous group of "voters". Think how the term "Chix Flix" or the success story called "NetFlix", illustrate this subjectivity...
Yet, if you persist in implementing the model you suggest, there are several hidden variables and system dynamics that need to be acknowledged and possibly taken into account in the rating's formula.
the existence of a third, implicit, value of the vote: "No vote"
i.e. when someone views the movie page and yet doesn't vote, either way.
The problem of dealing with this extra value is its ambiguity: do people not vote because they didn't see the movie or because they neither truly like nor disliked it? Very likely a bit of both, therefore we can/should use the count of the "Page views without vote" in the formula, to boost (somewhat) the rating of movies that do not generate a strong (positive or negative) sentiment (lest the "polarizing" movies will appear more notorious or popular)
the bandwagon effect
Past a certain threshold, and particularly if the rating and/or vote counts is visible before the page view, the rating and vote counts can influence the way people decide to vote (either way) or even decide to abstain from voting. The implication is that the total vote and/or view counts do not relate linearly to the effective rating.
"quality" vs. "notoriety"
Vote ratios in general (eg "likes" / "total" or "likes"/"dislikes" etc.) are indicative of the "quality" of a movie (note the quotes around quality...), whereby the number of votes (and of views) is indicative of the notoriety ("name recognition" etc.) of a movie.
statistical representativity
Very small vote and/or view counts are to be handled carefully because they introduce much volatility in the rating. Phrased otherwise, small samples make for not so statically representative ratings.
trends (the time variable)
At the risk of complicating the model, consider keeping [some] record of when votes/view happened, to allow identifying "hot" (and "cooling") movies in the collection. This info may inform the rating logic, but also may be used to direct the users towards currently hot items. BTW, hence feeding the bandwagon effect mentioned :-( but also, increasing the voting sample size :-).
All these considerations suggest caution in implementing this rating system. It also hints at the likely need of including statistics about the complete set of movies into the rating formula for an individual movie. In other words, do not rate a given movie solely on the basis of the its own vote/view counts but also on say the average vote counts a move receives, the maximum view a movie page gets etc. In fact, an iterative process, whereby movies are [roughly] ranked at first and then the ranking is recalculated by using the statistics of groups of movies similarly rated may provide a better system (provided the formulas are "fair" and somehow converge)
A standard trick is to start with a neutral baseline: say 10 likes and 10 dislikes that gives a score of 1. The first few votes don't change the ratio too much, but as votes accumulate, the baseline is overwhelmed. The exact choice of the baseline values will influence the rating of a new movie (the two values don't have to be equal), and how many votes are needed to change the rating substantially.

Resources