Main question: Suppose you have a discrete, finite data set $d$. Then the command summary(d) returns the Min, 1st quartile, Median, mean, 3rd quartile, and max. My question is: what formula does R use to compute the 1st quartile?
Background: My data set was: d=c(1,2,3,3,4,9). summary(d) returns 2.25 as the first quartile. Now, one way to compute the first quartile is to choose a value q1 such that 25% of the data set is less than of equal to q1. Clearly, this is not what R is using. So, I was wondering, what formula does R use to compute the first quartile?
Google searches on this topic have left even even more puzzled, and I couldn't find a formula that R uses. Typing help(summary) in R wasn't helpful to me too.
General discussion:
There are many different possibilities for sample quantile functions; we want them to have various properties (including being simple to understand and explain!), and depending on which properties we want most, we might prefer different definitions.
As a result, the wide variety of packages between them use many different definitions.
The paper by Hyndman and Fan [1] gives six desirable properties for a sample quantile function, lists nine existing definitions for the quantile function, and mentions which (of a number of common) packages use which definitions. Its Introduction says (sorry, the mathematics in this quote doesn't render properly any more, since it was moved to SO):
the sample quantiles that are used in statistical
packages are all based on one or two order statistics, and
can be written as
\hat{Q}_i(p) = (1 - γ) X_{(j)} + γ X_{(j+1)}\,,
where \frac{j-m}{n}\leq p< \frac{j-m+1}{n} \quad (1)
for some m\in \mathbb{R} and 0\leq\gamma\leq 1.
Which is to say, in general, the sample quantiles can be written as some kind of weighted average of two adjacent order statistics (though it may be that there's only weight on one of them).
In R:
In particular, R offers all nine definitions mentioned in Hyndman & Fan (with $7$ as the default). From Hyndman & Fan we see:
Definition 7. Gumbel (1939) also considered the modal position
$p_k = \text{mode}\,F(X_{(k)}) = (k-l)/(n-1)$. One nice property is that the vertices of $Q_7(p)$ divide the range into $n-1$ intervals, and exactly $100p\%$ of the intervals lie to the left of $Q_7(p$) and $100(1-p)\%$ of the intervals lie to the right of $Q_7(p)$.
What does this mean? Consider n=9. Then for (k-1)/(n-1) = 0.25, you need k = 1+(9-1)/4 = 3. That is, the lower quartile is the 3rd observation of 9.
We can see that in R:
quantile(1:9)
0% 25% 50% 75% 100%
1 3 5 7 9
For its behavior when n is not of the form 4k+1, the easiest thing to do is try it:
> quantile(1:10)
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
> quantile(1:11)
0% 25% 50% 75% 100%
1.0 3.5 6.0 8.5 11.0
> quantile(1:12)
0% 25% 50% 75% 100%
1.00 3.75 6.50 9.25 12.00
When k isn't integer, it's taking a weighted average of the adjacent order statistics, in proportion to the fraction it lies between them (that is, it does linear interpolation).
The nice thing is that on average you get 3 times as many observations above the first quartile as you get below. So for 9 observations, for example, you get 6 above and 2 below the third observation, which divides them into the ratio 3:1.
What's happening with your sample data
You have d=c(1,2,3,3,4,9), so n is 6. You need (k-1)/(n-1) to be 0.25, so k = 1 + 5/4 = 2.25. That is, it takes 25% of the way between the second and third observation (which coincidentally are themselves 2 and 3), so the lower quartile is 2+0.25*(3-2) = 2.25.
Under the hood: some R details:
When you call summary on a data frame, this results in summary.data.frame being applied to the data frame (i.e. the relevant summary for the class you called it on). Its existence is mentioned in the help on summary.
The summary.data.frame function (ultimately -- via summary.default applied to each column) calls quantile to compute quartiles (you won't see this in the help, unfortunately, since ?summary.data.frame simply takes you to the summary help and that doesn't give you any details on what happens when summary is applied to a numeric vector -- this is one of those really bad spots in the help).
So ?quantile (or help(quantile)) describes what R does.
Here are two things it says (based directly off Hyndman & Fan). First, it gives general information:
All sample quantiles are defined as weighted averages of consecutive order
statistics. Sample quantiles of type i are defined by:
Q[i](p) = (1 - γ) x[j] + γ x[j+1],
where 1 ≤ i ≤ 9, (j-m)/n ≤ p < (j-m+1)/n, x[j] is the jth order statistic,
n is the sample size, the value of γ is a function of j = floor(np + m) and
g = np + m - j, and m is a constant determined by the sample quantile type.
Second, there's specific information about method 7:
Type 7
m = 1-p
. p[k] = (k - 1) / (n - 1). In this case, p[k] = mode[F(x[k])]. This is used by S.
Hopefully the explanation I gave earlier helps to make more sense of what this is saying. The help on quantile pretty much just quotes Hyndman & Fan as far as definitions go, and its behavior is pretty simple.
Reference:
[1]: Rob J. Hyndman and Yanan Fan (1996),
"Sample Quantiles in Statistical Packages,"
The American Statistician, Vol. 50, No. 4. (Nov.), pp. 361-365
Also see the discussion here.
Related
For an analysis I need to perform a "F-pseudosigma", also called the "pseudo standard deviation". I tried to look if it's in any R package, but can't find it myself.
There isn't much info on it to begin with.
Does any of you know a package that holds it, or if it is calculated in a function from a package?
I have to admit that I haven't heard about F-pseudo sigma (or pseudo sigma) before; but a bit of research suggests that it is simply defined as the scaled difference between the third and first quartile.
That can be easily translated into a custom R function
fpseudosig <- function(x) unname(diff(quantile(x, c(0.25, 0.75)) / 1.35))
For example, let's generate some random data x ~ N(0, 1)
set.seed(2018)
x <- rnorm(100)
Then
fpseudosig(x)
#[1] 0.9703053
References
(in no particular order)
Irwin, Exploratory Data Analysis for Beginners: "Instead of the using the standard deviation in an RSD calculation, one might consider using the sample-data deviation (F-pseudosigma). This is a nonparametric statistic analogous to the standard deviation that is calculated by using the 25th and 75th percentiles in a data set. It is resistant to the effect of extreme outliers."
https://bqs.usgs.gov/srs/SRS_Spr04/statrate.htm: "The F-pseudosigma is calculated by dividing the fourth-spread (analogous to interquartile range) by 1.349; therefore the smaller the F-pseudosigma the more precise the determinations. The 1.349 value is derived from the number of standard deviations that encompasses 50% of the data."
http://mkseo.pe.kr/stats/?p=5: "Simply put, given the first quartile H1 and the third quartile H3, pseudo sigma is (H3-H1)/1.35. Why? It’s because H1= μ – 0.675σ and H3 = μ + 0.675σ if X ∼N. Therefore, H3-H1=1.35σ, resulting in σ = (H3-H1)/1.35. We call H3-H1 as IQR(Inter Quartile Range)."
I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.
The corresponding R Data Frame looks as follows:
x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted
These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread.
Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say e.g. 50% of the predictions are within -X% and +Y% of the actual values.
I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, i.e. 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?
Let's consider the two distributions as df_actual, df_predicted, then calculate
# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)
Then find the relative % difference by :
x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100
This will give you % prediction whether +/- in x as well as y. This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.
I hope this helps.
ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted) alone. I.e., if you have something like:
x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...
you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x"), then calculate and plot y.x-y.y.
To better quantify whether the differences between your predicted and actual distributions are significant, you could consider using the Kolmogorov-Smirnov test in R, available via the function ks.test
I was working on statistics using R. Before i do this using R program, i have done it manually. So here is the problem.
A sample of 300 TV viewers were asked to rate the overall quality of television shows from 0 (terrible) to 100 (the best). A histogram was constructed from the results, and it was noted that
it was mound-shaped and symmetric, with a sample mean of 65 and a sample standard
deviation of 8. Approximately what proportion of ratings would be above 81?
I have answered it manually with this :
Pr(X>81)=Pr(Z>(81-65)/8)=Pr(Z>2)=0.0227
So the proportion is 0.023 or 2.3%
I have trouble with how can i do this in R ? I have tried using pnorm(p=..,mean=..,sd=..) but didnt find similar result with my manual.
Thank you so much for the answer
You identified the correct function.
The help on pnorm gives the list of arguments:
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
with the explanation for the arguments:
x, q: vector of quantiles.
mean: vector of means.
sd: vector of standard deviations.
log, log.p: logical; if TRUE, probabilities p are given as log(p).
lower.tail: logical; if TRUE (default), probabilities are P[X <= x]
otherwise, P[X > x].
Under "Value:" it says
... ‘pnorm’ gives the distribution function,
So that covers everything. If you put the correct value you want the area to the left of in for q and the correct mu and sigma values, you will get the area below it. If you want the area above, add lower.tail=FALSE.
Like so:
pnorm(81,65,8) # area to left
[1] 0.9772499
pnorm(81,65,8,lower.tail=FALSE) # area to right ... which is what you want
[1] 0.02275013
(this way is more accurate than subtracting the first thing from 1 when you get into the far upper tail)
Edit: This diagram might clarify things:
I'm trying to calculate the AUC for a large-ish data set and having trouble finding one that both handles values that aren't just 0's or 1's and works reasonably quickly.
So far I've tried the ROCR package, but it only handles 0's and 1's and the pROC package will give me an answer but could take 5-10 minutes to calculate 1 million rows.
As a note all of my values fall between 0 - 1 but are not necessarily 1 or 0.
EDIT: both the answers and predictions fall between 0 - 1.
Any suggestions?
EDIT2:
ROCR can deal with situations like this:
Ex.1
actual prediction
1 0
1 1
0 1
0 1
1 0
or like this:
Ex.2
actual prediction
1 .25
1 .1
0 .9
0 .01
1 .88
but NOT situations like this:
Ex.3
actual prediction
.2 .25
.6 .1
.98 .9
.05 .01
.72 .88
pROC can deal with Ex.3 but it takes a very long time to compute. I'm hoping that there's a faster implementation for a situation like Ex.3.
So far I've tried the ROCR package, but it only handles 0's and 1's
Are you talking about the reference class memberships or the predicted class memberships?
The latter can be between 0 and 1 in ROCR, have a look at its example data set ROCR.simple.
If your reference is in [0, 1], you could have a look at (disclaimer: my) package softclassval. You'd have to construct the ROC/AUC from sensitivity and specificity calculations, though. So unless you think of an optimized algorithm (as ROCR developers did), it'll probably take long, too. In that case you'll also have to think what exactly sensitivity and specificity should mean, as this is ambiguous with reference memberships in (0, 1).
Update after clarification of the question
You need to be aware that grouping the reference or actual together looses information. E.g., if you have actual = 0.5 and prediction = 0.8, what is that supposed to mean? Suppose these values were really actual = 5/10 and prediction = 5/10.
By summarizing the 10 tests into two numbers, you loose the information whether the same 5 out of the 10 were meant or not. Without this, actual = 5/10 and prediction = 8/10 is consistent with anything between 30 % and 70 % correct recognition!
Here's an illustration where the sensitivity is discussed (i.e. correct recognition e.g. of click-through):
You can find the whole poster and two presentaions discussing such issues at softclassval.r-forge.r-project.org, section "About softclassval".
Going on with these thoughts, weighted versions of mean absolute, mean squared, root mean squared etc. errors can be used as well.
However, all those different ways to express of the same performance characteristic of the model (e.g. sensitivity = % correct recognitions of actual click-through events) do have a different meaning, and while they coincide with the usual calculation in unambiguous reference and prediction situations, they will react differently with ambiguous reference / partial reference class membership.
Note also, as you use continuous values in [0, 1] for both reference/actual and prediction, the whole test will be condensed into one point (not a line!) in the ROC or specificity-sensitivity plot.
Bottom line: the grouping of the data gets you in trouble here. So if you could somehow get the information on the single clicks, go and get it!
Can you use other error measures for assessing method performance? (e.g. Mean Absolute Error, Root Mean Square Error)?
This post might also help you out, but if you have different numbers of classes for observed and predicted values, then you might run into some issues.
https://stat.ethz.ch/pipermail/r-help/2008-September/172537.html
Given the results for a simple A / B test...
A B
clicked 8 60
ignored 192 1940
( ie a conversation rate of A 4% and B 3% )
... a fisher test in R quite rightly says there's no significant difference
> fisher.test(data.frame(A=c(8,192), B=c(60,1940)))
...
p-value = 0.3933
...
But what function is available in R to tell me how much I need to increase my sample size to get to a p-value of say 0.05?
I could just increase the A values (in their proportion) until I get to it but there's got to be a better way? Perhaps pwr.2p2n.test [1] is somehow usable?
[1] http://rss.acs.unt.edu/Rdoc/library/pwr/html/pwr.2p2n.test.html
power.prop.test() should do this for you. In order to get the math to work I converted your 'ignored' data to impressions by summing up your columns.
> power.prop.test(p1=8/200, p2=60/2000, power=0.8, sig.level=0.05)
Two-sample comparison of proportions power calculation
n = 5300.739
p1 = 0.04
p2 = 0.03
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
That gives 5301, which is for each group, so your sample size needs to be 10600. Subtracting out the 2200 that have already run, you have 8400 "tests" to go.
In this case:
sig.level is the same as your p-value.
power is the likelihood of finding significant results that exist within your sample. This is somewhat arbitrary, 80% is a common choice. Note that choosing 80% means that 20% of the time you won't find significance when you should. Increasing the power means you'll need a larger sample size to reach your desired significance level.
If you wanted to decide how much longer it will take to reach significance, divide 8400 by the number of impressions per day. That can help determine if its worth while to continue the test.
You can also use this function to determine required sample size before testing begins. There's a nice write-up describing this on the 37 Signals blog.
This is a native R function, so you won't need to add or load any packages. Other than that I can't say how similar this is to pwr.p2pn.test().