R: Function that finds the range of 95% of all values? - r

Is there a function or an elegant way in the R language, to get the minimum range, that covers, say 95% of all values in a vector?
Any suggestions are very welcome :)

95% of the data will fall between the 2.5th percentile and 97.5th percentile. You can compute that value in R as follows:
x <- runif(100)
quantile(x,probs=c(.025,.975))
To get a sense of what's going on, here's a plot:
qts <- quantile(x,probs=c(.05,.95))
hist(x)
abline(v=qts[1],col="red")
abline(v=qts[2],col="red")
Note this is the exact/empirical 95% interval; there's no normality assumption.

It's not so hard to write such function:
find_cover_region <- function(x, alpha=0.95) {
n <- length(x)
x <- sort(x)
k <- as.integer(round((1-alpha) * n))
i <- which.min(x[seq.int(n-k, n)] - x[seq_len(k+1L)])
c(x[i], x[n-k+i-1L])
}
Function will find shortest interval. If there are intervals with the same length first (from -Inf) will be picked up.
find_cover_region(1:100, 0.70)
# [1] 1 70
find_cover_region(rnorm(10000), 0.9973) # three sigma, approx (-3,3)
# [1] -2.859 3.160 # results may differ
You could also look on highest density regions (e.g. in package hdrcde, function hdr). It's more statistical way to find shortest intervals with given cover probability (some kernel density estimators are involved).

The emp.hpd function in the TeachingDemos package will find the values in a vector that enclose a given percentage of the data (95%) that also give the shortest range between the values. If the data is roughly symmetric then this will be close to the results of using quantile, but if the data are skewed then this will give a shorter range.

If the values are distributed approximately like the normal distribution, you can use the standard deviation. First, calculate the mean µ and standard deviation of the distribution. 95% of the values will be in the interval of (µ - 1.960 * stdev, µ + 1.960 * stdev).

Related

how to sample from an upside down bell curve

I can generate numbers with uniform distribution by using the code below:
runif(1,min=10,max=20)
How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve")
Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
As #Gregor-reinstateMonica noted, Beta distribution is bounded on both ends, [0...1], so it could be easily mapped into any bounded interval just by scale and shift. It has two parameters, and symmetric if those parameters are equal. Above 1 parameters make it kind of bell distribution, but below 1 parameters make it into inverse bell, what you're looking for. You could play with them, put different values instead of 0.5 and see how it is going. Parameters equal to 1 makes it uniform.
Sampling from a beta distribution is a good idea. Another way is to sample a number of uniform numbers and then take the minimum or maximum of them.
According to the theory of order statistics, the cumulative distribution function for the maximum is F(x)^n where F is the cdf from which the sample is taken and n is the number of samples, and the cdf for the minimum is 1 - (1 - F(x))^n. For a uniform distribution, the cdf is a straight line from 0 to 1, i.e., F(x) = x, and therefore the cdf of the maximum is x^n and the cdf of the minimum is 1 - (1 - x)^n. As n increases, these become more and more curved, with most of the mass close to the ends.
A web search for "order statistics" will turn up some resources.
If you don't care about decimal places, a hacky way would be to generate a large sample of normally distributed datapoints using rnorm(), then count the number of times each given rounded value appears (n), and then substract n from the maximum value of n (max(n)) to get inverse counts.
You can then use the inverse count to make a new vector (that you can sample from), i.e.:
library(tidyverse)
x <- rnorm(100000, 100, 15)
x_tib <- round(x) %>%
tibble(x = .) %>%
count(x) %>%
mutate(new_n = max(n) - n)
new_x <- rep(x_tib$x, x_tib$new_n)
qplot(new_x, binwidth = 1)
An "upside-down bell curve" compared to the normal distribution can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. Notice that this sampler samples in a truncated interval (here, the interval [x0, x1]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if y < 1-exp(-(x*x)): return x
end

Majority observations outside confidence interval

I have a time series (called sigma.year), with a length of 100. Its histogram and qqplot shows strong evidence for normality.
I am calculating confidence interval (in R) for sample mean as follows:
exp.sigma<-mean(sigma.year)
sd.sigma<-sd(sigma.year)
se.sigma=sd.sigma/sqrt(length(sigma.year))
me.sigma=qt(.995,df=length(sigma.year)-1)*se.sigma
low.sigma=exp.sigma-me.sigma
up.sigma=exp.sigma+me.sigma
My problem is 83/100 observations falls outside the confidence interval. Do you have any idea why I have this so? Is this because I have time-series, rather than cross section data? Or, am I calculating conf interval in a wrong way?
Thanks.
It's hard to evaluate completely without knowing all of your inputs (for example, a dput of sigma.year), but your confidence interval appears to be a confidence interval for the mean. So it is not unexpected that 83/100 observations are outside of a 99% confidence interval about the mean.
To clarify. If sd.sigma is the standard deviation of your sample, then you have correctly calculated the 99% confidence interval about the mean.
And again, your data are behaving as you'd expect for a sample of 100 observations drawn from a population with a normal distribution. Here's some code to check that:
x <- rnorm(100)
exp.x <- mean(x)
se.x <- sd(x)/sqrt(length(x))
q.x <- qt(0.995, df = length(x)-1)
interval <- c(exp.x - se.x*q.x, exp.x + se.x*q.x)
sum(x > interval[1] & x < interval[2])
# will vary, because I didn't set the seed on purpose, but try this
# you'll get a value around 20

Gamma equivalent to standard deviations

I have a gamma distribution fit to my data using libary(fitdistrplus). I need to determine a method for defining the range of x values that can be "reasonably" expected, analogous to using standard deviations with normal distributions.
For example, x values within two standard deviations from the mean could be considered to be the reasonable range of expected values from a normal distribution. Any suggestions for how to define a similar range of expected values based on the shape and rate parameters of a gamma distribution?
...maybe something like identifying the two values of x that between which contains 95% of the data?
Let's assume we have a random variable that is gamma distributed with shape alpha=2 and rate beta=3. We would expect this distribution to have mean 2/3 and standard deviation sqrt(2)/3, and indeed we see this in simulated data:
mean(rgamma(100000, 2, 3))
# [1] 0.6667945
sd(rgamma(100000, 2, 3))
# [1] 0.4710581
sqrt(2) / 3
# [1] 0.4714045
It would be pretty weird to define confidence ranges as [mean - gamma*sd, mean + gamma*sd]. To see why, consider if we selected gamma=2 in the example above. This would yield confidence range [-0.276, 1.609], but the gamma distribution can't even take on negative values, and 4.7% of data falls above 1.609. This is at the very least not a well balanced confidence interval.
A more natural choice might by to take the 0.025 and 0.975 percentiles of the distribution as a confidence range. We would expect 2.5% of data to fall below this range and 2.5% of data to fall above the range. We can use qgamma to determine that for our example parameters the confidence range would be [0.081, 1.857].
qgamma(c(0.025, 0.975), 2, 3)
# [1] 0.08073643 1.85721446
The mean expected value of a gamma is:
E[X] = k * theta
The variance is Var[X] = k * theta^2 where, k is shape and theta is scale.
But typically I would use 95% quantiles to indicate data spread.

Derivative of Kernel Density

I am using density {stats} to construct a kernel "gaussian' density of a vector of variables. If I use the following example dataset:
x <- rlogis(1475, location=0, scale=1) # x is a vector of values - taken from a rlogis just for the purpose of explanation
d<- density(x=x, kernel="gaussian")
Is there some way to get the first derivative of this density d at each of the n=1475 points
Edit #2:
Following up on Greg Snow's excellent suggestion to use the analytical expression for the derivative of a Gaussian, and our conversation following his post, this will get you the exact slope at each of those points:
s <- d$bw;
slope2 <- sapply(x, function(X) {mean(dnorm(x - X, mean = 0, sd = s) * (x - X))})
## And then, to compare to the method below, plot the results against one another
plot(slope2 ~ slope)
Edit:
OK, I just reread your question, and see that you wanted slopes at each of the points in the input vector x. Here's one way you might approximate that:
slope <- (diff(d$y)/diff(d$x))[findInterval(x, d$x)]
A possible further refinement would be to find the location of the point within its interval, and then calculate its slope as the weighted average of the slope of the present interval and the interval to its right or left.
I'd approach this by averaging the slopes of the segments just to the right and left of each point. (A bit of special care needs to be taken for the first and last points, which have no segment to their left and right, respectively.)
dy <- diff(d$y)
dx <- diff(d$x)[1] ## Works b/c density() returns points at equal x-intervals
((c(dy, tail(dy, 1)) + c(head(dy, 1), dy))/2)/dx
The curve of a density estimator is just the sum of all the kernels, in your case a gaussian (divided by the number of points). The derivative of a sum is the sum of the derivatives and the derivative of a constant times a function is that constant times the derivative. So the derivative of the density estimate at a given point will just be the average of the slopes for the 1475 different gaussian curves at that given point. Each gaussian curve will have a mean corresponding to each of the data points and a standard deviation based on the bandwidth. So if you can calculate the slope for a gaussian, then finding the slope for the density estimate is just a mean of the 1475 slopes.

Plotting Probability Density / Mass Function of Dataset in R

I have a dataset and I want to analyse these data with a probability density function or a probability mass function in R. I used a density function but it didn't gave me the probability.
My data are like this:
"step","Time","energy"
1, 22469 , 392.96E-03
2, 22547 , 394.82E-03
3, 22828,400.72E-03
4, 21765, 383.51E-03
5, 21516, 379.85E-03
6, 21453, 379.89E-03
7, 22156, 387.47E-03
8, 21844, 384.09E-03
9 , 21250, 376.14E-03
10, 21703, 380.83E-03
I want to the get PDF/PMF for the energy vector ; the data we take into account are discrete in nature so I don't have any special type for the distribution of the data.
Your data looks far from discrete to me. Expecting a probability when working with continuous data is plain wrong. density() gives you an empirical density function, which approximates the true density function. To prove it is a correct density, we calculate the area under the curve :
energy <- rnorm(100)
dens <- density(energy)
sum(dens$y)*diff(dens$x[1:2])
[1] 1.000952
Given some rounding error. the area under the curve sums up to one, and hence the outcome of density() fulfills the requirements of a PDF.
Use the probability=TRUE option of hist or the function density() (or both)
eg :
hist(energy,probability=TRUE)
lines(density(energy),col="red")
gives
If you really need a probability for a discrete variable, you use:
x <- sample(letters[1:4],1000,replace=TRUE)
prop.table(table(x))
x
a b c d
0.244 0.262 0.275 0.219
Edit : illustration why the naive count(x)/sum(count(x)) is not a solution. Indeed, it's not because the values of the bins sum to one, that the area under the curve does. For that, you have to multiply with the width of the 'bins'. Take the normal distribution, for which we can calculate the PDF using dnorm(). Following code constructs a normal distribution, calculates the density, and compares with the naive solution :
x <- sort(rnorm(100,0,0.5))
h <- hist(x,plot=FALSE)
dens1 <- h$counts/sum(h$counts)
dens2 <- dnorm(x,0,0.5)
hist(x,probability=TRUE,breaks="fd",ylim=c(0,1))
lines(h$mids,dens1,col="red")
lines(x,dens2,col="darkgreen")
Gives :
The cumulative distribution function
In case #Iterator was right, it's rather easy to construct the cumulative distribution function from the density. The CDF is the integral of the PDF. In the case of the discrete values, that simply the sum of the probabilities. For the continuous values, we can use the fact that the intervals for the estimation of the empirical density are equal, and calculate :
cdf <- cumsum(dens$y * diff(dens$x[1:2]))
cdf <- cdf / max(cdf) # to correct for the rounding errors
plot(dens$x,cdf,type="l")
Gives :

Resources