Under the user generated posts on my site, I have an Amazon-like rating system:
Was this review helpful to you: Yes | No
If there are votes, I display the results above that line like so:
5 of 8 people found this reply helpful.
I would like to sort the posts based upon these rankings. If you were ranking from most helpful to least helpful, how would you order the following posts?
a) 1/1 = 100% helpful
b) 2/2 = 100% helpful
c) 999/1000 = 99.9% helpful
b) 3/4 = 75% helpful
e) 299/400 = 74.8% helpful
Clearly, its not right to sort just on the percent helpful, somehow the total votes should be factored in. Is there a standard way of doing this?
UPDATE:
Using Charles' formulas to calculate the Agresti-Coull lower range and sorting on it, this is how the above examples would sort:
1) 999/1000 (99.9%) = 95% likely to fall in 'helpfulness' range of 99.2% to 100%
2) 299/400 (74.8%) = 95% likely to fall in 'helpfulness' range of 69.6% to 79.3%
3) 3/4 (75%) = 95% likely to fall in 'helpfulness' range of 24.7% to 97.5%
4) 2/2 (100%) = 95% likely to fall in 'helpfulness' range of 23.7% to 100%
5) 1/1 (100%) = 95% likely to fall in 'helpfulness' range of 13.3% to 100%
Intuitively, this feels right.
UPDATE 2:
From an application point of view, I don't want to be running these calculations every time I pull up a list of posts. I'm thinking I'll either update and store the Agresti-Coull lower bound either on a regular, cron-driven schedule (updating only those posts which have received a vote since the last run) or update it whenever a new vote is received.
For each post, generate bounds on how helpful you expect it to be. I prefer to use the Agresti-Coull interval. Pseudocode:
float AgrestiCoullLower(int n, int k) {
//float conf = 0.05; // 95% confidence interval
float kappa = 2.24140273; // In general, kappa = ierfc(conf/2)*sqrt(2)
float kest=k+kappa^2/2;
float nest=n+kappa^2;
float pest=kest/nest;
float radius=kappa*sqrt(pest*(1-pest)/nest);
return max(0,pest-radius); // Lower bound
// Upper bound is min(1,pest+radius)
}
Then take the lower end of the estimate and sort on this. So the 2/2 is (by Agresti-Coull) 95% likely to fall in the 'helpfulness' range 23.7% to 100%, so it sorts below the 999/1000 which has range 99.2% to 100% (since .237 < .992).
Edit: Since some people seem to have found this helpful (ha ha), let me note that the algorithm can be tweaked based on how confident/risk-averse you want to be. The less confidence you need, the more willing you will be to abandon the 'proven' (high-vote) reviews for the untested but high-scoring reviews. A 90% confidence interval gives kappa = 1.95996398, an 85% confidence interval gives 1.78046434, a 75% confidence interval gives 1.53412054, and the all-caution-to-the-wind 50% confidence interval gives 1.15034938.
The 50% confidence interval gives
1) 999/1000 (99.7%) = 50% likely to fall in 'helpfulness' range of 99.7% to 100%
2) 299/400 (72.2%) = 50% likely to fall in 'helpfulness' range of 72.2% to 77.2%
3) 2/2 (54.9%) = 50% likely to fall in 'helpfulness' range of 54.9% to 100%
4) 3/4 (45.7%) = 50% likely to fall in 'helpfulness' range of 45.7% to 91.9%
5) 1/1 (37.5%) = 50% likely to fall in 'helpfulness' range of 37.5% to 100%
which isn't that different overall, but it does prefer the 2/2 to the safety of the 3/4.
This question is probably better asked on http://stats.stackexchange.com .
I guess you still want to order by increasing of 'helpfulness'.
If you want to know how precise a given number is, the simplest way is to use the square root of the variance of the Binomial distribution with n equal to the total number of responses and p the fraction of responses which were 'helpful'.
A very simple solution would be to ignore everything with less than a cut-off amount of votes, and then sort by percentage.
For example (require at least five votes)
1. 99.9% (1000 votes)
2. 74.8% (400 votes)
3-5. waiting for five votes
It depends on the expected rate of positive feedback and the number of the people that vote on average.
If, like in the example you give, you are going to have sometimes 5 and 10 people voting and other times a 1000, then I would suggest the Wilson midpoint:
(x+z^2/2)/(n+z^2) The midpoint of the Adjusted Wald Interval / Wilson Score
where:
n = Sum(all_votes),
x = Sum(positive_votes) / n,
z = 1.96 (fixed value)
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Using the R script solve the following:
An expert on process control states that he is 95% confident that the new production process will save between $26 and $38 per unit with savings values around $32 more likely.
If you were to model this expert's opinion using a normal distribution (by applying empirical rule), what standard deviation would you use for your normal distribution? (round your answer to 1 decimal place.
It appears that whoever wrote this problem is confused and doesn't know if they are asking a sample mean confidence interval problem "95% confident" or a simple population normal distribution problem.
Let's try to rationalize out how we can or can't solve this , and we will discover some problems with the wording of this problem as we go:
He says he is 95% confident that ... [ignore everything else for now and assume that this is a confidence interval of the mean problem ... and we will see why that is wrong] ... First let's calculate the z score on the normal distribution table which corresponds with a .95 % cumulative probability. You said you want to do this in R, so use qnorm()
> qnorm(.95)
[1] 1.644854
Now we knew that the money saved is between: $26 and $38. $26 we now know is 1.644854 standard errors below the sample mean if his 95% confidence implies that this is a sample mean and $38 we now know is 1.644854 standard errors above the estimated mean from their sample mean ... (if this were a confidence interval problem). Their sample mean is presumably $32.
let's say we try to solve for the st dev. The standard error is:
StDev / sqrt(sample size)
and the confidence interval is:.
lower bound: 32 - 1.644854 * StDev / sqrt(sample size);
upper bound: 32 + 1.644854 * StDev / sqrt(sample size) # we will use this below
We could attempt to solve algebraically for StDev by putting the upper bound formula on the left side of the = sign and put 38 which is the upper bound on the right side:
32 + (1.644854 * StDev / sqrt(sample size)) = $38 ... Now solve for StDev
StDev = (sqrt(sample size)* (38-32))/1.644854) ... If I didn't screw up my mental algebra at midnight without paper in hand...
There's a problem here that this rhetorical exercise was meant to point out: We still have 2 variables. The problem you posted simply didn't tell us enough information to solve this with the assumption this was a confidence interval from a sample. We are kind of out of luck if this is where they were going with this.
It looks like the 95% confidence clause (and absence of a mention of a sample mean) is meant to throw you off, but in reality, it just makes the person asking you this question appear to be confused as to what question they are asking you.
If you re-frame the question and assume that 1. the 95% confident clause is junk information 2. We are talking about individual probabilities that a given observation falls at or below a given value, not that we are 95% confident that the average observation does, and 3. That whoever wrote the question does not understand the proper usage of the phrase 95% confident or was exhausted when they wrote it ... or you mis-transcribed the problem ... Then the question should be would be worded like this: "We know that in 95% of all cases saved no more than $38 and 5% of customers saved $26 or less." In this case we could drop the standard error term altogether and we would then only be worried about the standard deviation and mean of the population:
The population mean then is 32
the mean + 1.644854 standard deviations is 38 (95% of customers save no more than this)
38 - 32 = 6 (this is equal to 1.644854 StDev): Algebraically that's written:
6 = 1.644854 * stdev
divide both sides by 1.644854:
6/1.644854 = StDev
StDev = 3.64774
Let's verify that we got this right:
> qnorm(.95,mean=32,sd=3.64774) # Verify in R that the stdev we calculated is correct: .95 cumulative probability, mean is 32 and we assert that the StDev is 3.64774. We got:
[1] 38
$38 or less is what 95% would get. This seems right.
> qnorm(.05,mean=32,sd=3.64774)
[1] 26
26 or less is what the 5 % of customers that saved the least got. This seems right also.
Summary:
The question you posted doesn't make any sense. It is either incomplete,mis-transcribed, or whoever wrote it seems to be a little confused.
If you ignore the 95% confidence clause and reframe the question to make a guess to compensate for how ambiguous it was, then the answer is: The standard deviation is 3.6.
According to Empirical rule of normal distribution:
68% of the data falls within one standard deviation, 95% percent within two standard deviations, and 99.7% within three standard deviations from the mean.
As the data says 95% confident, data will fall within 2 standard deviation.
So, min_value: 26=mean-2standard_deviation
or, max_value: 38=mean+2standard_deviation
Given, mean=32, Solving either of above two equations, standard_deviation=3.0
I used this code for lower limit =2 and upper limit =3, and it worked correctly, for lower values of the limits but it doesn't work for larger numbers unless I add 0.5 to sd
> f <- function(lwr, upr){
> c("mean"= (upr+lwr)/2,
> "stddev" = (upr-lwr)/4,
> "sdRound" =round((upr-lwr)/4,1)) }
> f(2,3)
With this, I get the answers as:
mean stddev sdRound
2.50 0.25 0.20
I can't use the value rounded in R. The correct answer is 0.3 since 0.25 when rounded is 0.3.
When I plug this sd=0.3 in the below, I get the correct upperlimit (and also lower limit)
> upperlimit = round(qnorm(0.95, mean=2.5, sd=0.3),0)
> lowerlimit = round(qnorm(0.05, mean=2.5, sd=0.3))
upperlimit =3
lowerlimit =2
This also works for f(6,9)
I am building an index of user-qualities defined as a sum of (often) correlated continious variables representing user-activity. The index is well-calibrated, and servs the purpos of my analysis, but is tricky to communicate to my co-workers, particularly, since outlier activities cause extremely tenatious users to score a very highly on the activity index.
For 97% of users, the index is distributed near-normally between 0 and 100, with a right tail of 3% of hyper-active users with an index > 100. Index-values beyond 200 should be extremely rare but are theoretically possible.
I'm looking to scale the tail back into a 0-100 span, but not linearly, since I would like the 3%-tail to be represented as small variances within the top-range of the 0-100 index. What I'm looking for a non-linear formula to scale my index, like this:
so that the lower tier of the unscaled index remains close to the scaled one, but where high index-values diverge, but where scaled values never reach 100 as my index goes towards infinity, so that x=0=f(x) but when x = 140, f(x) ≈ 99 or something similar
I'll implement the scaling in R, Python and BigQuery.
There are lots of ways to do this: take any function with the right shape and tweak it to your needs.
One family of functions with the right shape is
f(x) = x/pow(1 + pow(x/100, n), 1/n)
You can vary the parameter n to adjust the shape: increasing n pushes f(100) closer to 100. With n=5 you get something that looks pretty close to your drawing
f(x) = x/pow(1 + pow(x/100, 5), 0.2)
Another option is taking the hyperbolic tangent function tanh which you can of course tweak in similar ways:
f(x) = 100*pow(tanh(pow(x/100, n)), 1/n)
here's the curve with n=2:
Main question: Suppose you have a discrete, finite data set $d$. Then the command summary(d) returns the Min, 1st quartile, Median, mean, 3rd quartile, and max. My question is: what formula does R use to compute the 1st quartile?
Background: My data set was: d=c(1,2,3,3,4,9). summary(d) returns 2.25 as the first quartile. Now, one way to compute the first quartile is to choose a value q1 such that 25% of the data set is less than of equal to q1. Clearly, this is not what R is using. So, I was wondering, what formula does R use to compute the first quartile?
Google searches on this topic have left even even more puzzled, and I couldn't find a formula that R uses. Typing help(summary) in R wasn't helpful to me too.
General discussion:
There are many different possibilities for sample quantile functions; we want them to have various properties (including being simple to understand and explain!), and depending on which properties we want most, we might prefer different definitions.
As a result, the wide variety of packages between them use many different definitions.
The paper by Hyndman and Fan [1] gives six desirable properties for a sample quantile function, lists nine existing definitions for the quantile function, and mentions which (of a number of common) packages use which definitions. Its Introduction says (sorry, the mathematics in this quote doesn't render properly any more, since it was moved to SO):
the sample quantiles that are used in statistical
packages are all based on one or two order statistics, and
can be written as
\hat{Q}_i(p) = (1 - γ) X_{(j)} + γ X_{(j+1)}\,,
where \frac{j-m}{n}\leq p< \frac{j-m+1}{n} \quad (1)
for some m\in \mathbb{R} and 0\leq\gamma\leq 1.
Which is to say, in general, the sample quantiles can be written as some kind of weighted average of two adjacent order statistics (though it may be that there's only weight on one of them).
In R:
In particular, R offers all nine definitions mentioned in Hyndman & Fan (with $7$ as the default). From Hyndman & Fan we see:
Definition 7. Gumbel (1939) also considered the modal position
$p_k = \text{mode}\,F(X_{(k)}) = (k-l)/(n-1)$. One nice property is that the vertices of $Q_7(p)$ divide the range into $n-1$ intervals, and exactly $100p\%$ of the intervals lie to the left of $Q_7(p$) and $100(1-p)\%$ of the intervals lie to the right of $Q_7(p)$.
What does this mean? Consider n=9. Then for (k-1)/(n-1) = 0.25, you need k = 1+(9-1)/4 = 3. That is, the lower quartile is the 3rd observation of 9.
We can see that in R:
quantile(1:9)
0% 25% 50% 75% 100%
1 3 5 7 9
For its behavior when n is not of the form 4k+1, the easiest thing to do is try it:
> quantile(1:10)
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
> quantile(1:11)
0% 25% 50% 75% 100%
1.0 3.5 6.0 8.5 11.0
> quantile(1:12)
0% 25% 50% 75% 100%
1.00 3.75 6.50 9.25 12.00
When k isn't integer, it's taking a weighted average of the adjacent order statistics, in proportion to the fraction it lies between them (that is, it does linear interpolation).
The nice thing is that on average you get 3 times as many observations above the first quartile as you get below. So for 9 observations, for example, you get 6 above and 2 below the third observation, which divides them into the ratio 3:1.
What's happening with your sample data
You have d=c(1,2,3,3,4,9), so n is 6. You need (k-1)/(n-1) to be 0.25, so k = 1 + 5/4 = 2.25. That is, it takes 25% of the way between the second and third observation (which coincidentally are themselves 2 and 3), so the lower quartile is 2+0.25*(3-2) = 2.25.
Under the hood: some R details:
When you call summary on a data frame, this results in summary.data.frame being applied to the data frame (i.e. the relevant summary for the class you called it on). Its existence is mentioned in the help on summary.
The summary.data.frame function (ultimately -- via summary.default applied to each column) calls quantile to compute quartiles (you won't see this in the help, unfortunately, since ?summary.data.frame simply takes you to the summary help and that doesn't give you any details on what happens when summary is applied to a numeric vector -- this is one of those really bad spots in the help).
So ?quantile (or help(quantile)) describes what R does.
Here are two things it says (based directly off Hyndman & Fan). First, it gives general information:
All sample quantiles are defined as weighted averages of consecutive order
statistics. Sample quantiles of type i are defined by:
Q[i](p) = (1 - γ) x[j] + γ x[j+1],
where 1 ≤ i ≤ 9, (j-m)/n ≤ p < (j-m+1)/n, x[j] is the jth order statistic,
n is the sample size, the value of γ is a function of j = floor(np + m) and
g = np + m - j, and m is a constant determined by the sample quantile type.
Second, there's specific information about method 7:
Type 7
m = 1-p
. p[k] = (k - 1) / (n - 1). In this case, p[k] = mode[F(x[k])]. This is used by S.
Hopefully the explanation I gave earlier helps to make more sense of what this is saying. The help on quantile pretty much just quotes Hyndman & Fan as far as definitions go, and its behavior is pretty simple.
Reference:
[1]: Rob J. Hyndman and Yanan Fan (1996),
"Sample Quantiles in Statistical Packages,"
The American Statistician, Vol. 50, No. 4. (Nov.), pp. 361-365
Also see the discussion here.
Given the results for a simple A / B test...
A B
clicked 8 60
ignored 192 1940
( ie a conversation rate of A 4% and B 3% )
... a fisher test in R quite rightly says there's no significant difference
> fisher.test(data.frame(A=c(8,192), B=c(60,1940)))
...
p-value = 0.3933
...
But what function is available in R to tell me how much I need to increase my sample size to get to a p-value of say 0.05?
I could just increase the A values (in their proportion) until I get to it but there's got to be a better way? Perhaps pwr.2p2n.test [1] is somehow usable?
[1] http://rss.acs.unt.edu/Rdoc/library/pwr/html/pwr.2p2n.test.html
power.prop.test() should do this for you. In order to get the math to work I converted your 'ignored' data to impressions by summing up your columns.
> power.prop.test(p1=8/200, p2=60/2000, power=0.8, sig.level=0.05)
Two-sample comparison of proportions power calculation
n = 5300.739
p1 = 0.04
p2 = 0.03
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
That gives 5301, which is for each group, so your sample size needs to be 10600. Subtracting out the 2200 that have already run, you have 8400 "tests" to go.
In this case:
sig.level is the same as your p-value.
power is the likelihood of finding significant results that exist within your sample. This is somewhat arbitrary, 80% is a common choice. Note that choosing 80% means that 20% of the time you won't find significance when you should. Increasing the power means you'll need a larger sample size to reach your desired significance level.
If you wanted to decide how much longer it will take to reach significance, divide 8400 by the number of impressions per day. That can help determine if its worth while to continue the test.
You can also use this function to determine required sample size before testing begins. There's a nice write-up describing this on the 37 Signals blog.
This is a native R function, so you won't need to add or load any packages. Other than that I can't say how similar this is to pwr.p2pn.test().
Let's say that I have a diminishing value that should be portrayed both on a monthly basis and on a weekly basis.
For example. I know that the value, say 100 000, diminishes by 30 %/year. Which when I calculate (by normal "periodic compound" formulas) is 2.21 %/month and 0.51 %/week.
However, looking at the results from these calculations (calculating for a entire year) I do not get the same end valued. Only if I calculate it as a "interest" (=the percentage is ADDED to the value, NOT taken away) do I get matching values on both the weekly and monthly calculations.
What is the correct formula for calculating this "compound taxation" problem?
I don't know if I fully understand your question.
You cannot calculate diminushing interest the way you do it.
If your value (100 000) diminishes by 30 %/ year this means that at the end of year 1 your value is 70 000.
The way you calculated you compound would work if diminishing by 30% meant 100000/1.3
Your mistake:
You made your calculation this way:
(1+x)^12 - 1 =30% then x=0.0221 the monthly interest is 2.21%
(1+x)^52 - 1 = 30% then x=0.0051 the weekly interest is 0.51%
But what you should have done is:
(1-x)^12=1-30% then x =0.0292 the monthly interest is 2.92%
(1-x)^52=1-30% then x=0.0068 the monthly interest is 0.68 %
You cannot calculate the compound interest as if it was increasing 30% when it's decreasing 30%.
It's easy to understand that the compound interest for an increasing will be smallest than the one for decreasing:
Exemple:
Let's say your investment makes 30% per year.
At the end of first month you will have more money, and therefore you're investing more so you need a smaller return to make as much money as in the first month.
Therefore for increasing interest the coumpond interest i=2.21 is smaller than 30/12 = 2.5
same reasonning for the decreasing i =2.92 > 30/12=2.5
note:
(1+x)^12 - 1 =30% is not equivalent to (1-x)^12=1-30%
negative interest cannot be treated as negative interest:
following what you did adding 10% to one then taking away 10% to the result would return one:
(1+10%)/(1+10%)=1
The way it's calculated won't give the same result : (1+10%)*(1-10%)=0.99
Hope I understood your question and it helps .
Engaging psychic debugging...
diminishes by 30 %/year. Which when I
calculate (by normal "periodic
compound" formulas) is 2.21 %/month
and 0.51 %/week.
You are doing an inappropriate calculation.
You are correct in saying that 30% annual growth is approx 2.21% monthly growth. The reason for this is because 30% annual growth is expressed as multiplication by 1.30 (since 100% + 30% = 130%, or 1.30), and making this monthly is:
1.30 ^ (1/12) = 1.0221 (approx)
However, it does not follow from this that 30% annual shrinkage is approx 2.21% monthly shrinkage. To work out the monthly shrinkage we must note that 30% shrinkage is multiplication by 0.70 (since 100% - 30% = 70%, or 0.70), and make this monthly in the same way:
0.70 ^ (1/12) = 0.9707 (approx)
Multiplication by 0.9707 is monthly shrinkage of 2.929% (approx).
Hopefully this will give you the tools you need to correct your calculations.