How to scale no. of page hits? - math

I'll start with describing my problem.
I have n pages each with its own popularity factor. Popularity factor is on a scale of 10. Now, I have total page hits for each of the pages with me and I want to use those total page hits for calculating the popularity factor again on a scale of 10.
The total page hits is an absolute number and I have these values for only 1,70,000 pages. The total pages which I have with me is 41,00,000.
Now, my problem is I don't know how to normalize these total page hits for all of the total pages.
I tried doing this:
Popularity factor for each page = Total page hits for all the pages/total no. of pages.
I'll assume that the pages with no data will be having at least 1 total page hits. But that way my denominator becomes really big number and in the process of scaling on a scale of 10, I'm lost.
Can anyone please help with how can I approach it ?

There are several ways to do it. Here are some examples:
Absolute popularity
Find the number of hits of the most popular page.
Assign a popularity score bases on the number of hits compared to the most popular page:
0-10% = popularity 1, 10-20% = popularity 2 and so on.
Relative popularity
Sort all pages according to number of page hits.
Assign a popularity score based on the position in the list:
0-10% = popularity 1, 10-20% = popularity 2 and so on.
Popularity of pages without statistics
I can't give you any advice on how to handle these. If you don't know how many times a page has been accessed it is really hard to give it a popularity score.

Related

Choosing a similarity metric for user-scores of television shows

I have a database of user ratings of various television shows on a 1-10 scale. I've been trying to find a good way of determining how similar two user score-lists are two one another for shared shows.
It feels like the most obvious way to do this is just to take the absolute value of the difference. And then sum/average that for all shared shows. But I was reading this does not take into account how users will rate things on different scales. I saw some people saying cosine similarity is better for this sort of thing. Unfortunately, I've run into a lot of cases where that metric doesn't really make sense.
Example:
overall average of user1 = 8.1
overall average of user2 = 5.8
scores for shared shows only:
S1 = [8,8,10,10,10,10,6,8,10,5,6,10]
S2 = [5,6,7,8,9,9,4,5,9,1,2,8]
Obviously, these two people rated the shows they watched pretty differently. When I use the average difference it says they are not very similar (2.3 where 0 is the same). When I use something like the cosine similarity it says they are extremely similar (0.97 where 1 is the same).
Is there a metric that would be better suited for this kind of thing? My ultimate goal is to recommend users shows from other users that have similar tastes to them.

WordPress Popularity Score relative to all posts (score views, shares and comments)

I am trying to find a solution to score my posts in an effort to visually represent how "hot" they are, relative to other posts, or relative to a score of 1.
I have views, total shares and comments available as inputs, but I'm unsure what approach to take to score my posts against each other, so I have a relative score out of 1, i.e. .6, .8 etc.
I came across the following in my search for a solution
http://www.evanmiller.org/rank-hotness-with-newtons-law-of-cooling.html
and
Hot content algorithm / score with time decay
Any ideas on how I can approach this? The posts should also take into account time, as I have several years worth of posts.
Thank you
I ended up setting 5 scoring periods, then scoring each post against the highest score found in that period. The score takes in social media, post views and comments. Provides a good rating!

Ratingsystem that considers time and activity

I'm looking for a rating system that does not only weight the rating on number of votes, but also time and "activity"
To clarify a bit:
Consider a site where users produce something, like a picture.
There is another type of user that can vote on other peoples pictures (on a scale 1-5), but one picture will only recieve one vote.
The rating a productive user gets is derived from the rating his/hers pictures have recieved, but should be affected by:
How long ago the picture was made
How productive the user has been
A user who's getting 3's and 4's and still making 10 pictures per week should get higher rating than a person that have gotten 5's but only made 1 pic per week and stopped a few month ago.
I've been looking at Bayesian estimate, but that only considers the total amount of votes independent of time or productivity.
My math fu is pretty strong, so all I need is a nudge in right direction and I can probably modify something to fit my needs.
There are many things you could do here.
The obvious approach is to have your measure of the scores decay with time in your internal calculations, for example using an exponential decay with a time constant T. For example, use value = initial_score*exp(-t/T) where t is the time that's passed since picture was submitted. So if T is one month, after one month this score will contribute 1/e, or about 0.37 that it originally did. (You can also do this differentially, btw, with value -= (dt/T)*value, if that's more convenient.)
There's probably a way to work this with a Bayesian approach, but it seems forced to me. Bayesian approaches are generally about predicting something new based on a (usually large) set of prior data, which doesn't directly match your model.

How can I determine the "correct" number of steps in a questionnaire where branching is used?

I have a potential maths / formula / algorithm question which I would like help on.
I've written a questionnaire application in ASP.Net that takes people through a series of pages. I've introduced conditional processing, or branching, so that some pages can be skipped dependent on an answer, e.g. if you're over a certain age, you will skip the page that has Teen Music Choice and go straight to the Golden Oldies page.
I wish to display how far along the questionnaire someone is (as a percentage). Let's say I have 10 pages to go through and my first answer takes me straight to page 9. Technically, I'm now 90% of the way through the questionnaire, but the user can think of themselves as being on page 2 of 3: the start page (with the branching question), page 9, then the end page (page 10).
How can I show that I'm on 66% and not 90% when I'm on page 9 of 10?
For further information, each Page can have a number of questions on that can have one or more conditions on them that will send the user to another page. By default, the next page will be the next one in the collection, but that can be over-ridden (e.g. entire sets of pages can be skipped).
Any thoughts? :-s
Simple answer is you can't. From what you have said you won't know until you have reached the end how many pages the user is going to see so you can't actually display an accurate result.
What you could do to get a better result as in your example is to assume they are going through all the remaining pages. In this case you would on any page have:
Number of pages gone through so far including current (visited_pages)
Number of the current page (page_position)
Total number of pages (total_pages)
The maximum number of pages is now:
max_pages = total_pages - page_position + visited_pages
You can think of total_pages-page_position as being the number of pages left to visit which makes the max_pages quite intuitive.
So in the 10 page example you gave visited_pages = 2 (page 1 and page 9), page_position = 9 and total_pages = 10.
so max_pages = 10-9+2 = 3.
then to work out the distance through you just do
progress = visited_pages/max_pages*100
One thing to note is that if you were to go to pages 1,2,3,4,9,10 then your progress would be 10%,20%,30%,40%,83%,100% so you would still get a strange jump that may confuse people. This is pretty much inevitable though.
Logically unless the user's question graph is predetermined, you can not predetermine their percentage complete.
That being said.
What you can do is build a full graph of the users expected path based on what information you know: what questions they have completed, and what questions they have to take, and simply calculate a percentage.
This will probably involve a data structure such as a linked list to calculate where they have been, and what the user has left to complete, this implementation is up to you.
The major caveat here is you have to accept that if the users question graph changes, so will their percent completed. Theoretically they could display 90% and then it changes and they will display 50%.

Determining the popularity of a video with ratings and views

I am about to embark on a new project - a video website. Users will be able to register, and vote on videos by clicking "like" or "dislike", or something to that effect. In any event, it will be a 2-option voting system, not a 5-star system.
Every X number of days, I will be generating a "chart" of the most popular videos. So my question is: how should I determine the popularity of a given video?
If I went the route of tallying up the videos with the most views, this could have the effect of exceptionally bad videos making it to the of the charts (just because they're so bad).
If I go the route of a scoring system based on the amount of "like" and "dislike" votes (eg. 100 like votes, and 50 dislike votes equals a score of 2), videos with few views could appear on the top of the charts.
So, what I need to do is a combination of the two. Barring, of course, spammy views and votes.
What's your guys' thoughts on the subject?
Edit: the following tags were removed: [mysql] [postgresql], to make room for other, more representative tags; the SQL technology used in the intended implementation does not seem to bear much on the considerations regarding the rating model per-se.
You seem to be missing the point that likes and dislikes in movies are anything but objective even within the context of a relatively homogeneous group of "voters". Think how the term "Chix Flix" or the success story called "NetFlix", illustrate this subjectivity...
Yet, if you persist in implementing the model you suggest, there are several hidden variables and system dynamics that need to be acknowledged and possibly taken into account in the rating's formula.
the existence of a third, implicit, value of the vote: "No vote"
i.e. when someone views the movie page and yet doesn't vote, either way.
The problem of dealing with this extra value is its ambiguity: do people not vote because they didn't see the movie or because they neither truly like nor disliked it? Very likely a bit of both, therefore we can/should use the count of the "Page views without vote" in the formula, to boost (somewhat) the rating of movies that do not generate a strong (positive or negative) sentiment (lest the "polarizing" movies will appear more notorious or popular)
the bandwagon effect
Past a certain threshold, and particularly if the rating and/or vote counts is visible before the page view, the rating and vote counts can influence the way people decide to vote (either way) or even decide to abstain from voting. The implication is that the total vote and/or view counts do not relate linearly to the effective rating.
"quality" vs. "notoriety"
Vote ratios in general (eg "likes" / "total" or "likes"/"dislikes" etc.) are indicative of the "quality" of a movie (note the quotes around quality...), whereby the number of votes (and of views) is indicative of the notoriety ("name recognition" etc.) of a movie.
statistical representativity
Very small vote and/or view counts are to be handled carefully because they introduce much volatility in the rating. Phrased otherwise, small samples make for not so statically representative ratings.
trends (the time variable)
At the risk of complicating the model, consider keeping [some] record of when votes/view happened, to allow identifying "hot" (and "cooling") movies in the collection. This info may inform the rating logic, but also may be used to direct the users towards currently hot items. BTW, hence feeding the bandwagon effect mentioned :-( but also, increasing the voting sample size :-).
All these considerations suggest caution in implementing this rating system. It also hints at the likely need of including statistics about the complete set of movies into the rating formula for an individual movie. In other words, do not rate a given movie solely on the basis of the its own vote/view counts but also on say the average vote counts a move receives, the maximum view a movie page gets etc. In fact, an iterative process, whereby movies are [roughly] ranked at first and then the ranking is recalculated by using the statistics of groups of movies similarly rated may provide a better system (provided the formulas are "fair" and somehow converge)
A standard trick is to start with a neutral baseline: say 10 likes and 10 dislikes that gives a score of 1. The first few votes don't change the ratio too much, but as votes accumulate, the baseline is overwhelmed. The exact choice of the baseline values will influence the rating of a new movie (the two values don't have to be equal), and how many votes are needed to change the rating substantially.

Resources