Voting - Number of votes vs Vote percent? - math

I've implemented a simple up/down voting system on a website, and I keep track of individual votes as well as vote time and unique user iD (hashed IP).
My question is not how to calculate the percent or sum of the votes - but more, what is a good algorithm for determining a good score based on votes?
I find sorting by pure vote percent to be unacceptable, as well as simply tallying upvotes.
Consider this example:
Image A: 4 upvotes, 1 downvotes
Image B: 5 upvotes, 4 downvotes
Image C: 1 upvote, 0 downvotes
The ideal system would put A first, maybe followed by B and then C.
In a pure percentage scenario, the order is C > A > B. (wrong)
In a pure vote count scenario, the order is B > A > C. (wrong)
I have an idea for a somewhat "hybrid" algorithm based on the system's confidence in a score, maybe something along the lines of:
// (if totalvotes > 0, else score = 0)
score = 1 - ((downvotes+1 / totalvotes+1) * sqrt(1 / totalvotes))
However, I was hoping to ask the community if there are any really well-defined algorithms already out there that I simply don't know about, before I sit around tweaking my algorithm from now until sunset.
I also have date data for each vote - however, the content of the site isn't very time-sensitive so I don't really care to sort by "what's hot" at all.

Sorting by the average of votes is not very good.
By instead balancing the proportion of positive ratings with the uncertainty of a small number of observations like explained in this article, you achieve a much better representation of your scores.
The article below explains how to not make the same mistake that many popular websites do. (Amazon, urbandictionary etc.)
http://evanmiller.org/how-not-to-sort-by-average-rating.html
Hope this helps!

I know that doesn't answer your question, but I just spent 3 minutes for fun trying to find some formula and... just check it :) A column is upvotes and B is downvotes :)
=(LN((A1+1)/(A1+B1+1))+1)*LN(A1)
5 3 0.956866995
4 1 1.133543015
5 4 0.787295787
1 0 0
6 4 0.981910844
2 8 -0.207447157
6 5 0.826007385
3 3 0.483811507
4 0 1.386294361
5 0 1.609437912
6 1 1.552503332
5 2 1.146431478
100 100 -3.020151034
10 10 0.813671022

Related

R: optimal sorting/allocation/distribution of items

I'm hoping someone may be able to help with a problem I have - trying to solve using R.
Individuals can submit requests for items. The minimum number of requests per person is one. There is a recommended maximum of five, but people can submit more in exceptional circumstances. Each item can only be allocated one individual.
Each item has a 'desirability'/quality score ranging from 10 (high quality) down to 0 (low quality). The idea is to allocate items, in line with requests, such that as many high quality items as possible are allocated. It is less important that individuals have an equitable spread of requests met.
Everyone has to have at least one request met. Next priority is to look at whether we can get anyone who is over the recommended limit within it by allocating requests to others. After that the priority is to look at where the item would rank in each individual's request list based on quality score, and allocate to the person where it would rank highest (eg, if it would be first in someone's list and third in another's, give it to the former).
Effectively I'd need a sorting algorithm of some kind that:
Identifies where an item has been requested more than once
Check all the requests of everyone making said request
If that request is the only one a person has made, give it to them
(if this scenario applies to more than one person, it should be
flagged in some way)
If all requestees have made more than one request, check to see if
any have made more than five requests - if they have it can be taken
off them.
If all are within the recommended limit, see where the request would
rank (based on quality score) and give to the person in whose list it
would rank highest.
The process needs to check that the above step isn't happening to people so many times that it leaves them without any requests...so it
effectively has to check one item at a time.
Does anyone have any ideas about how to approach this? I can think of all kinds of why I could arrange the data to make it easy to identify and see where this needs to happen, but not to automate the process itself. Thanks in advance for any help.
The data (at least the bits needed for this process) looks like the below:
Item ID Person ID Item Score
1 AAG 9
1 AAK 8
2 AAAX 8
2 AN 8
2 AAAK 8
3 Z 8
3 K 8
4 AAC 7
4 AR 5
5 W 10
5 V 9
6 AAAM 7
6 AAAL 7
7 AAAAN 5
7 AAAAO 5
8 AB 9
8 D 9
9 AAAAK 6
9 AAAAC 6
10 A 3
10 AY 3

What does support feature mean in result of function "term_stats()" from package "tm" in R and how is it different from count?

Running following script will produce the results
a <- c("Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle. As with all matters of the heart, you'll know when you find it. - Steve Jobs")
a_source <- VectorSource(a)
a_corpus <- VCorpus(a_source)
term_stats(a_corpus)
term_stats(a_corpus)
term count support
1 . 5 1
2 to 5 1
3 is 4 1
4 you 4 1
5 , 3 1
Support is the number of documents where the word occurs, count is the number of occurrences. You need both if doing tf-idf.
library(tm)
txt <- c("Your work is going to fill a large part of your life,
and the only way to be truly satisfied is to do what you
believe is great work.
And the only way to do great work is to love what you do.
If you haven't found it yet, keep looking. Don't settle.
As with all matters of the heart, you'll know when you find it.
- Steve Jobs")
term_stats(VCorpus(VectorSource(txt)))[1:5,]
term count support
. 5 1
to 5 1
is 4 1
#Split txt into 4 docs
txt_df <- data.frame( txt = c(
"Your work is going to fill a large part of your life,
and the only way to be truly satisfied is to do what you
believe is great work." ,
"And the only way to do great work is to love what you do." ,
"If you haven't found it yet, keep looking. Don't settle." ,
"As with all matters of the heart, you'll know when you find it. -
Steve Jobs"))
term_stats(VCorpus(VectorSource(txt_df$txt)))[1:6,]
term count support
. 5 4
you 4 4
, 3 3
the 3 3
to 5 2
is 4 2
Default is to sort by support.

Calculate how much a point is worth based on played games

My problem I have is that I need to calculate out how much a point is worth based on played games.
If a team plays a match it can get 3 points for a win, 1 point for a tie and 0 points for a loss.
And the problem here is following:
Team 1
Wins:8 Tie:2 Loss:3 Points:26 Played Games: 13
Team 2
Wins:8 Tie:3 Loss:4 Points:27 Played Games: 15
And here you can see that Team 2 has 1 more point than Team 1 has. But Team 2 has played 2 more matches and have a lesser win % then Team 1 has. But if you should list these two then Team 2 would get a higher "rating" then Team 1 has.
So how should the math look for this to make it fair? where Team 1 will have a better score here then Team 2 ?
Just divide by the number of games to get the average points per game played.
Team1: 2.0 ppg
Team2: 1.8 ppg
Okey first of all thanks for the help.
And the solution of this is the following:
p/pg * p = Real points
p = Sum(points),
pg = Played games
So for the example up top the real points will be:
Team 1: 52
Team 2: 48.6

R Linear programming

Example 1.
Use R, in similar way as above, to solve the following problem:
The Handy-Dandy Company makes three types of kitchen appliances (A, B and C).
To make each of
these appliance types, just two inputs are required - labour and materials. Each unit of A made requires
7 hours of labour and 4 Kg of materials; for each unit of B made the requirements are 3 hours of
labour and 4 Kg of materials, while for C the unit requirements are 6 hours of labour and 5 Kg of
material.
The company expects to make a profit of €40 for every unit of A sold, while the profit per
unit for B and C are €20 and €30 respectively. Given that the company has available to it 150 hours of
labour and 200 Kg of material each day, formulate this as a linear programming problem.
Click here
x1 <- Rglpk_read_file("F:\ \Linear_programming_R\\first.txt", type = "MathProg")
Rglpk_solve_LP(x1$objective, x1$constraints[[1]], x1$constraints[[2]], x1$constraints[[3]],
x1$bounds, x1$types, x1$maximum)
Can someone explain to me what 1,2,3 in brackets mean? Thanks
Those access elements of a list; so x1$constraints is a list and x1$constraints[[1]] is the first component of that list.
The operator $ accesses a variable in an object (data.frame). Have a look at some tutorial about data types in R for example here

simple rank formula

I'm looking for a mathmatical ranking formula.
Sample is
2008 2009 2010
A 5 6 4
B 6 7 5
C 7 8 2
I want to add a rank column for each period code field
rank
2008 2009 2010 2008 2009 2010
B 6 7 5 2 1 1
A 5 6 4 3 2 2
C 7 2 2 1 3 3
please do not reply with methods that loop thru the rows and columns, incrementing the rank value as it goes, that's easy. I'm looking for a formula much like finding the percent total (item / total). I know i've seen this before but an havning a tough time locating it.
Thanks in advance!
sort ((letters_col, number_col) descending by number_col)
As efficient as your sort alg.
Then number the rows, of course
Edit
I really got upset by your comment "please don't up vote this answer, sorting and loop is not what I'm asking for. i specifically stated this in my original question. " , and the negative votes, because, as you may have noted by the various answers received, it's basically correct.
However, I remained pondering where and how you may "have seen this before".
Well, I think I got the answer: You saw this in Excel.
Look at this:
This is the result after entering the formulas and sorting by column H.
It's exactly what you want ...
What are you using? If you're using Excel, you're looking for RANK(num, ref).
=RANK(B2,B$2:B$9)
I don't know of any programming language that has that built in, it would always require a loop of some form.
If you want the rank of a single element, you can do it in O(n) by looping through the elements, counting how many have value above the given element, and adding 1.
If you want the rank of all the elements, the best (and really only) way is to sort the elements. Anything else you do will be equivalent to sorting (there is no "formula")
Are you using T-SQL? T-SQL RANK() may pull what you want.

Resources