Boxplot.stats R not identifying outliers - r

I have used boxplot.stats$out to get outliers of a list in R. However I noticed that many times it fails to identify outliers. For example:
list = c(3,4,7,500)
boxplot.stats(list)
$`stats`
[1] 3.0 3.5 5.5 253.5 500.0
$n
[1] 4
$conf
[1] -192 203
$out
numeric(0)
quantile(list)
0% 25% 50% 75% 100%
3.00 3.75 5.50 130.25 500.00
130.25+1.5*IQR(list) = 320
As you can see the boxplot.stats() function failed to find the outlier 500, even though when I looked at the documentation they are using the Q1/Q3+/-1.5*IQR method. So 500 should've been identified as an outlier, but it clearly is not finding it and I'm not sure why?
I have tried this with a list of 5 elements instead of 4, or with an outlier that is very small instead of very large and I still get the same problem.

Notice that the third number in the "stats" portion is 253.5, not 130.25
The documentation for boxplot.stats says:
The two ‘hinges’ are versions of the first and third quartile, i.e.,
close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd
n (where n <- length(x)) and differ for even n. Whereas the quartiles
only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do
so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle
of two observations otherwise
In other words, for your data, it is using (500+7)/2 as the Q3 value
(and incidentally (3+4)/2 = 3.5 as Q1, not the 3.75 that you got from
quantile). Boxplot will use the boundary 253.5 + 1.5*(253.5 - 3.5) = 628.5

If you read the help page help("boxplot.stats") carefully, the return value section says the following. My emphasis.
stats
a vector of length 5, containing the extreme of the lower
whisker, the lower ‘hinge’, the median, the upper ‘hinge’ and
the extreme of the upper whisker.
Then, in the same section, again my emphasis.
out
the values of any data points which lie beyond the
extremes of the whiskers (if(do.out)).
Your data has 4 points. The extreme of the upper whisker, as returned in list member $stats, is 500.0, and this is the maximum of your data. There is no error.

Try this,
library (car)
Boxplot (Petal.Length ~ Species, id = list (n=Inf))
to identify all the outliers

Related

R How to sample from an interrupted upside down bell curve

I've asked a related question before which successfully received an answer. Now I want to sample values from an upside down bell curve but exclude a range of values that fall in the middle of it like shown on the picture below:
I have this code currently working:
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
How may I adapt it to achieve the desired output?
Say you want a sample of 10,000 from your distribution but don't want any numbers between 5 and 15 in your sample. Why not just do:
q <- min + (max-min)*rbeta(50000, 0.5, 0.5);
q <- q[!(q > 5 & q < 15)][1:10000]
Which gives you this:
hist(q)
But still has the correct size:
length(q)
#> [1] 10000
An "upside-down bell curve" compared to the normal distribution, with the exclusion of a certain interval, can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. I adapted it from another answer I just posted.
Notice that this sampler samples in a truncated interval (here, the interval [x0, x1], with the exclusion of [x2, x3]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if (x<x2 or x>x3) and y < 1-exp(-(x*x)): return x
end

Print significant auto-correlation value

If I do an autocorrelation test in R (acf), I get a great graph, and the horizontal lines show the cutoff of significance.
acf also prints out the individual lag values in the console, however, here I can't see which are significant. Is there an easy way to do that without looking at the graph?
So basically for this we need to know the cutoff value. By writing acf and stats:::plot.acf you can see that it might be different for different parameter values, but for default values here is what you should use:
set.seed(123)
x <- arima.sim(list(ar = 0.5), 100)
r <- acf(x, plot = FALSE)$acf
which(abs(r)[-1] >= qnorm(1 - 0.05 / 2) / sqrt(length(x)))
# [1] 1 2 3 9 10 12 13
where 0.05 is the significance level in this case.

R function to calculate area under the normal curve between adjacent standard deviations

I'm looking into GoF (goodness of fit) testing, and wanted to see if the quantiles of a vector of data followed the expected frequency of a normal distribution N(0, 1), and before running the chi square test, I generated these frequencies for the normal distribution:
< -2 SD's (standard deviations), between -2 and -1 SD's, between -1 and 0 SD's, between 0 and 1 SD's, between 1 and 2 SD's, and more than 2 SD's.
To do so I took the long route:
(Normal_distr <- c(pnorm(-2), pnorm(-1) - pnorm(-2), pnorm(0) - pnorm(-1),
pnorm(1) - pnorm(0), pnorm(2) - pnorm(1), pnorm(2, lower.tail = F)))
[1] 0.02275013 0.13590512 0.34134475 0.34134475 0.13590512 0.02275013
I see that the symmetry allows me to cut down the length of the code, but isn't there an easier way... something (I don't think this will work, but the idea of...) like pnorm(-2:-1) returning an identical value to pnorm(-1) - pnorm(-2) = 0.13590512?
Question: Is there an R function that calculates the area under the normal curve between quantiles so that we can pass a vector such as c(-3:3) through it, as opposed to subtracting pnorm()'s of adjacent standard deviations or other quantiles?
I'n not sure if there is a specific function to do this, but you can do it pretty simply like so:
#Get difference between adjacent quantiles
diff(pnorm(-2:-1))
[1] 0.1359051
#Get area under normal curve from -2 to 2 sd's:
sum(diff(pnorm(-2:2)))
[1] 0.9544997

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

How to get the right simulation in R

Question:
Suppose the numbers in the following random number table correspond to people arriving for work at a large factory. Let 0,1,and 2 be smokers and 3-9 be nonsmokers. After many arrivals, calculate the total relative frequency of smokers .
here is my R code to simulate the total relative frequency of smokers.
simulation<-function(k){
x<-round(runif(k)*10)
return (length(x[x<3])/k)}
> simulation(100)
[1] 0.27
> simulation(1000)
[1] 0.244
> simulation(10000)
[1] 0.2445
> simulation(100000)
[1] 0.24923
Why i can't get the result 0.3?
If all you want to do is get a discrete uniform distribution on the numbers 0, 1, ..., 9 then just use sample
sample(0:9, k, replace = TRUE)
With the code you have right now you'll actually get a probability of .05 each of getting 0 or 10 and a probability of .10 each of getting 1-9.

Resources