Majority observations outside confidence interval - r

I have a time series (called sigma.year), with a length of 100. Its histogram and qqplot shows strong evidence for normality.
I am calculating confidence interval (in R) for sample mean as follows:
exp.sigma<-mean(sigma.year)
sd.sigma<-sd(sigma.year)
se.sigma=sd.sigma/sqrt(length(sigma.year))
me.sigma=qt(.995,df=length(sigma.year)-1)*se.sigma
low.sigma=exp.sigma-me.sigma
up.sigma=exp.sigma+me.sigma
My problem is 83/100 observations falls outside the confidence interval. Do you have any idea why I have this so? Is this because I have time-series, rather than cross section data? Or, am I calculating conf interval in a wrong way?
Thanks.

It's hard to evaluate completely without knowing all of your inputs (for example, a dput of sigma.year), but your confidence interval appears to be a confidence interval for the mean. So it is not unexpected that 83/100 observations are outside of a 99% confidence interval about the mean.
To clarify. If sd.sigma is the standard deviation of your sample, then you have correctly calculated the 99% confidence interval about the mean.
And again, your data are behaving as you'd expect for a sample of 100 observations drawn from a population with a normal distribution. Here's some code to check that:
x <- rnorm(100)
exp.x <- mean(x)
se.x <- sd(x)/sqrt(length(x))
q.x <- qt(0.995, df = length(x)-1)
interval <- c(exp.x - se.x*q.x, exp.x + se.x*q.x)
sum(x > interval[1] & x < interval[2])
# will vary, because I didn't set the seed on purpose, but try this
# you'll get a value around 20

Related

99% confidence interval, proportion

Maybe a dumb question, but I just started, so thanks for any help.
I created 99% confidence interval for a proportion, but I'm not sure if it is correct, how can I make sure, (when we're calculating confidence interval for mean, we're using t-score, and we can test the results by using t.test function and degrees of freedom)
Is there any similar function to do the same thing for z, proportions? or I can do the same thing by using t.test?
There are a number of functions in R for computing confidence intervals for proportions. Personally, I like to use the binomCI function from the MKinfer package:
install.packages("MKinfer")
library(MKinfer)
x <- 50 # The number of "successes" in your data
n <- 100 # The number of observations in your data
binomCI(x, n, conf.level = 0.99, method = "wald")
Note however that the so-called Wald interval (which is presented in most introductory texts on statistics) that you probably computed usually is a poor choice. See this link for some other alternatives available through binomCI.

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

R: boot.ci - does type="basic" provide the correct confidence interval end points?

In short, my question is whether boot.ci returns the correct interval endpoints for type="basic".
I am computing confidence intervals based on the boot package, using boot and boot.ci. I noticed that some confidence intervals looked strange when I used the "basic" type for the boot.ci function. In contrast, "bca" or "perc" produced what I expected. My first guess would be that "basic" mixes something up when substracting lower / upper end points.
But I might be wrong, e.g., missing some crucial difference between the types "basic" and "bca" that explain this behavior.
See, e.g., the following code example. For purpose of illustration I create some highly skewed random data. I try to compute confidence intervals for the median. What I would expect, is a confidence interval that is positive (because the data is). What I get (see the first plot, based on type="basic") is a lower end point that is strongly negative, and the interval is skewed in the wrong direction. The second plot (based on type="bca") shows pretty much what I expect to happen if everything works correctly.
require(boot)
set.seed(1)
x <- 10^runif(100,0,5)-1 #sample data
medw <- function(x,i){ #statistic for bootstrap
mm <- median(x[i])
}
resb <- boot(x,medw,R=1000) #bootstrap
ci <- boot.ci(resb,0.95,type = "basic") #confidence
require(plotrix) #for plotting
par(mfrow=c(1,2))
plotCI(ci$t0,li=ci$basic[4],ui=ci$basic[5]) #confidence plot
boxplot(x) #boxplot, just for some visual context
ci <- boot.ci(resb,0.95,type = "bca") #confidence
plotCI(ci$t0,li=ci$bca[4],ui=ci$bca[5]) #confidence plot BCA
boxplot(x) #boxplot, just for some visual context

Gamma equivalent to standard deviations

I have a gamma distribution fit to my data using libary(fitdistrplus). I need to determine a method for defining the range of x values that can be "reasonably" expected, analogous to using standard deviations with normal distributions.
For example, x values within two standard deviations from the mean could be considered to be the reasonable range of expected values from a normal distribution. Any suggestions for how to define a similar range of expected values based on the shape and rate parameters of a gamma distribution?
...maybe something like identifying the two values of x that between which contains 95% of the data?
Let's assume we have a random variable that is gamma distributed with shape alpha=2 and rate beta=3. We would expect this distribution to have mean 2/3 and standard deviation sqrt(2)/3, and indeed we see this in simulated data:
mean(rgamma(100000, 2, 3))
# [1] 0.6667945
sd(rgamma(100000, 2, 3))
# [1] 0.4710581
sqrt(2) / 3
# [1] 0.4714045
It would be pretty weird to define confidence ranges as [mean - gamma*sd, mean + gamma*sd]. To see why, consider if we selected gamma=2 in the example above. This would yield confidence range [-0.276, 1.609], but the gamma distribution can't even take on negative values, and 4.7% of data falls above 1.609. This is at the very least not a well balanced confidence interval.
A more natural choice might by to take the 0.025 and 0.975 percentiles of the distribution as a confidence range. We would expect 2.5% of data to fall below this range and 2.5% of data to fall above the range. We can use qgamma to determine that for our example parameters the confidence range would be [0.081, 1.857].
qgamma(c(0.025, 0.975), 2, 3)
# [1] 0.08073643 1.85721446
The mean expected value of a gamma is:
E[X] = k * theta
The variance is Var[X] = k * theta^2 where, k is shape and theta is scale.
But typically I would use 95% quantiles to indicate data spread.

R: Function that finds the range of 95% of all values?

Is there a function or an elegant way in the R language, to get the minimum range, that covers, say 95% of all values in a vector?
Any suggestions are very welcome :)
95% of the data will fall between the 2.5th percentile and 97.5th percentile. You can compute that value in R as follows:
x <- runif(100)
quantile(x,probs=c(.025,.975))
To get a sense of what's going on, here's a plot:
qts <- quantile(x,probs=c(.05,.95))
hist(x)
abline(v=qts[1],col="red")
abline(v=qts[2],col="red")
Note this is the exact/empirical 95% interval; there's no normality assumption.
It's not so hard to write such function:
find_cover_region <- function(x, alpha=0.95) {
n <- length(x)
x <- sort(x)
k <- as.integer(round((1-alpha) * n))
i <- which.min(x[seq.int(n-k, n)] - x[seq_len(k+1L)])
c(x[i], x[n-k+i-1L])
}
Function will find shortest interval. If there are intervals with the same length first (from -Inf) will be picked up.
find_cover_region(1:100, 0.70)
# [1] 1 70
find_cover_region(rnorm(10000), 0.9973) # three sigma, approx (-3,3)
# [1] -2.859 3.160 # results may differ
You could also look on highest density regions (e.g. in package hdrcde, function hdr). It's more statistical way to find shortest intervals with given cover probability (some kernel density estimators are involved).
The emp.hpd function in the TeachingDemos package will find the values in a vector that enclose a given percentage of the data (95%) that also give the shortest range between the values. If the data is roughly symmetric then this will be close to the results of using quantile, but if the data are skewed then this will give a shorter range.
If the values are distributed approximately like the normal distribution, you can use the standard deviation. First, calculate the mean µ and standard deviation of the distribution. 95% of the values will be in the interval of (µ - 1.960 * stdev, µ + 1.960 * stdev).

Resources