3 dimension prediction limit in R - r

Let a data set with 3 variables (here Normal and independent but they can be correlated)
data = data.frame(x1 = rnorm(10000),
x2 = rnorm(10000),
x3 = rnorm(10000))
I would like to get the narrowest range for x1, x2 and x3 such that 95% of the observation fall within all three ranges.
So far I have the code below.
is.between <- function(x, a, b){
x <= max(c(a, b)) & x >= min(c(a, b))
}
getlims <- function(lims, x1, x2, x3){
abs(mean(
is.between(x1, lims[1], lims[2]) &
is.between(x2, lims[3], lims[4]) &
is.between(x3, lims[5], lims[6])
) - 0.95)
}
optim(initial_values, getlims, x1=x1,x2=x2,x3=x3)
Where lims[1,2] is the range for x1, lims[3,4] is the range for x2 and lims[5,6] is the range for x3.
It provides limits that contain 95% of my observation, but doesn't guarantee that it will be the smaller volume lims[1,2]*lims[3,4]*lims[5,6].

I think this is actually a problem in discrete optimization. It was given in three dimensions, I have reformulated it in two dimensions to have a better visualization, it can be immediately extended to more dimensions.
Let us try to solve it as a nonlinear optimization problem with constraints.
set.seed(1009)
N <- 1000
x <- rnorm(N); y <- rnorm(N)
The 0.05 and 0.95 quantiles of these coordinates will also be needed.
q1 <- quantile(x, 0.05); q2 <- quantile(x, 0.95)
q3 <- quantile(y, 0.05); q4 <- quantile(y, 0.95)
We define two functions, fmin the function that is to be minimized, and fbnd the function that defines the constraints. That is, we require fbnd(x) >= 0, in this way expressing that at least 95% of the points lie inside the rectangle.
fmin <- function(p) (p[2]-p[1]) * (p[4]-p[3])
fbnd <- function(p) {
c(0.05 - sum(x < p[1] | x > p[2] | y < p[3] | y > p[4]) / N,
q1 - p[1], p[2] - q2,
q3 - p[3], p[4] - q4 )
}
As starting point we can take the ranges of x- and y-coordinates.
start <- c(range(x), range(y))
The optimization solver has to minimize a function with nonlinear constraints. The auglag routine in package nloptr is a candidate solver.
S <- nloptr::auglag(start, fn=fmin, hin=fbnd)
S$par; S$value
# [1] -2.301263 2.308038 -2.079166 2.130744
# [1] 19.40474
We can improve the solution by moving the rectangle boundaries to the next upper or lower x- resp. y-coordinate. This is a necessary step as the objective function is locally constant.
r <- S$par
r[1] <- min(x[x >= r[1]]); r[2] <- max(x[x <= r[2]])
r[3] <- min(y[y >= r[3]]); r[4] <- max(y[y <= r[4]])
r
# [1] -2.299467 2.281395 -2.079166 2.127260
We can see, 50 points lie outside the rectangle and the area is 19.26905.
(r[2]-r[1]) * (r[4]-r[3]) # 19.26905
sum(x < r[1] | x > r[2] | y < r[3] | y > r[4]) # 50
The solution could still be a local minimum. Fortunately, the objective function is also locally monotone, so this will normally not happen. Of course, one could verify the solution by applying a global solver.

Related

Vectorized implementation of exponentially weighted moving standard deviation using R?

I am trying to implement a vectorized exponentially weighted moving standard deviation using R. Is this the correct approach?
ewma <- function (x, alpha) {
c(stats::filter(x * alpha, 1 - alpha, "recursive", init = x[1]))
}
ewmsd <- function(x, alpha) {
sqerror <- na.omit((x - lag(ewma(x, alpha)))^2)
ewmvar <- c(stats::filter(sqerror * alpha, 1 - alpha, "recursive", init = 0))
c(NA, sqrt(ewmvar))
}
I'm guessing it's not, since its output is different from Python's pandas.Series.ewm.std() function.
When I run
ewmsd(x = 0:9, alpha = 0.96)
the output is
[1] NA 0.2236068 0.4874679 0.7953500 1.1353903 1.4993855 1.8812961 2.2764708 2.6812160 3.0925367
However, with
pd.Series(range(10)).ewm(alpha = 0.96).std()
the output is
0 NaN
1 0.707107
2 0.746729
3 0.750825
4 0.751135
5 0.751155
6 0.751156
7 0.751157
8 0.751157
9 0.751157
According to the documentation for Pandas, the pandas.Series.ewm() function receives an adjust parameter, which defaults to TRUE. When adjust == TRUE, the exponentially weighted moving average from pandas.Series.ewm.mean() is calculated through weights, not recursively. Naturally, this affects the standard deviation output as well. See this Github issue and this question for more info.
Here's a vectorized solution in R:
ewmsd <- function(x, alpha) {
n <- length(x)
sapply(
1:n,
function(i, x, alpha) {
y <- x[1:i]
m <- length(y)
weights <- (1 - alpha)^((m - 1):0)
ewma <- sum(weights * y) / sum(weights)
bias <- sum(weights)^2 / (sum(weights)^2 - sum(weights^2))
ewmsd <- sqrt(bias * sum(weights * (y - ewma)^2) / sum(weights))
},
x = x,
alpha = alpha
)
}

R - finding roots for a cartesian product of function parameters

Given a function f(x,c,d) of x that also depends on some parameters c and d. I would like to find the zeroes for a cartesian product of certain values c_1,...,c_n and d_1,...,d_m of the parameters, i.e. an x_ij such that f(x_ij,c_i,d_j)=0 for i=1,...,n and j=1,...,m. Although not that crucial I am applying a Newton-Raphson algorithm for the root finding:
newton.raphson <- function(f, a, b, tol = 1e-5, n = 1000){
require(numDeriv) # Package for computing f'(x)
x0 <- a # Set start value to supplied lower bound
k <- n # Initialize for iteration results
# Check the upper and lower bounds to see if approximations result in 0
fa <- f(a)
if (fa == 0.0){
return(a)
}
fb <- f(b)
if (fb == 0.0) {
return(b)
}
for (i in 1:n) {
dx <- genD(func = f, x = x0)$D[1] # First-order derivative f'(x0)
x1 <- x0 - (f(x0) / dx) # Calculate next value x1
k[i] <- x1 # Store x1
# Once the difference between x0 and x1 becomes sufficiently small, output the results.
if (abs(x1 - x0) < tol) {
root.approx <- tail(k, n=1)
res <- list('root approximation' = root.approx, 'iterations' = k)
return(res)
}
# If Newton-Raphson has not yet reached convergence set x1 as x0 and continue
x0 <- x1
}
print('Too many iterations in method')
}
The actual function that I am interest is more complicated, but the following example illustrates my problem.
test.function <- function(x=1,c=1,d=1){
return(c*d-x)
}
Then for any given c_i and d_j I can easily calculate the zero by
newton.raphson(function(x) test.function(x,c=c_i,d=d_j),0,1)[1]
which here is obviously just the product c_i*d_j.
Now I tried to define a function that finds for two given vectors (c_1,...,c_n) and (d_1,...,d_m) the zeroes for all combinations. For this, I tried to define
zeroes <- function(ci=1,dj=1){
x<-newton.raphson(function(x) test.function(x,c=ci,d=dj),0,1)[1]
return(as.numeric(x))
}
and then use the outer-function, e.g.
outer(c(1,2),c(1,2,3),FUN=zeroes)
Unfortunately, this did not work. I got an error message
Error during wrapup: dims [product 6] do not match the length of object [1]
There might be also a much better solution to my problem. I am happy for any input.

For a polynomial, get all its extrema and plot it by highlighting all monotonic pieces

Someone asked me this interesting question and I think it worthwhile posting it here, as there has not been any relevant thread on Stack Overflow.
Suppose I have polynomial coefficients in a length-n vector pc, where a polynomial of degree n - 1 for variable x can be expressed in its raw form:
pc[1] + pc[2] * x + pc[3] * x ^ 2 + ... + pc[n] * x ^ (n - 1)
R core function polyroot can find all roots of this polynomial in complex domain. But often we are also interested in extrema, as for a univariate function, local minima and maxima turn up alternately, breaking the function into monotonic pieces.
My questions are:
How to obtain all extrema (actually all saddle points) in real domain of a polynomial?
How to sketch this polynomial with 2-colour scheme: red for ascending pieces and green for descending pieces?
It would be good to write this up as a function so that we can easily explore / visualize a polynomial.
As an example, consider a polynomial of degree 5:
pc <- c(1, -2.2, -13.4, -5.1, 1.9, 0.52)
obtain all saddle points of a polynomial
In fact, saddle points can be found by using polyroot on the 1st derivative of the polynomial. Here is a function doing it.
SaddlePoly <- function (pc) {
## a polynomial needs be at least quadratic to have saddle points
if (length(pc) < 3L) {
message("A polynomial needs be at least quadratic to have saddle points!")
return(numeric(0))
}
## polynomial coefficient of the 1st derivative
pc1 <- pc[-1] * seq_len(length(pc) - 1)
## roots in complex domain
croots <- polyroot(pc1)
## retain roots in real domain
## be careful when testing 0 for floating point numbers
rroots <- Re(croots)[abs(Im(croots)) < 1e-14]
## note that `polyroot` returns multiple root with multiplicies
## return unique real roots (in ascending order)
sort(unique(rroots))
}
xs <- SaddlePoly(pc)
#[1] -3.77435640 -1.20748286 -0.08654384 2.14530617
evaluate a polynomial
We need be able to evaluate a polynomial in order to plot it. My this answer has defined a function g that can evaluate a polynomial and its arbitrary derivatives. Here I copy this function in and rename it to PolyVal.
PolyVal <- function (x, pc, nderiv = 0L) {
## check missing aruments
if (missing(x) || missing(pc)) stop ("arguments missing with no default!")
## polynomial order p
p <- length(pc) - 1L
## number of derivatives
n <- nderiv
## earlier return?
if (n > p) return(rep.int(0, length(x)))
## polynomial basis from degree 0 to degree `(p - n)`
X <- outer(x, 0:(p - n), FUN = "^")
## initial coefficients
## the additional `+ 1L` is because R vector starts from index 1 not 0
beta <- pc[n:p + 1L]
## factorial multiplier
beta <- beta * factorial(n:p) / factorial(0:(p - n))
## matrix vector multiplication
base::c(X %*% beta)
}
For example, we can evaluate the polynomial at all its saddle points:
PolyVal(xs, pc)
#[1] 79.912753 -4.197986 1.093443 -51.871351
sketch a polynomial with a 2-colour scheme for monotonic pieces
Here is a function to view / explore a polynomial.
ViewPoly <- function (pc, extend = 0.1) {
## get saddle points
xs <- SaddlePoly(pc)
## number of saddle points (if 0 the whole polynomial is monotonic)
n_saddles <- length(xs)
if (n_saddles == 0L) {
message("the polynomial is monotonic; program exits!")
return(NULL)
}
## set a reasonable xlim to include all saddle points
if (n_saddles == 1L) xlim <- c(xs - 1, xs + 1)
else xlim <- extendrange(xs, range(xs), extend)
x <- c(xlim[1], xs, xlim[2])
## number of monotonic pieces
k <- length(xs) + 1L
## monotonicity (positive for ascending and negative for descending)
y <- PolyVal(x, pc)
mono <- diff(y)
ylim <- range(y)
## colour setting (red for ascending and green for descending)
colour <- rep.int(3, k)
colour[mono > 0] <- 2
## loop through pieces and plot the polynomial
plot(x, y, type = "n", xlim = xlim, ylim = ylim)
i <- 1L
while (i <= k) {
## an evaluation grid between x[i] and x[i + 1]
xg <- seq.int(x[i], x[i + 1L], length.out = 20)
yg <- PolyVal(xg, pc)
lines(xg, yg, col = colour[i])
i <- i + 1L
}
## add saddle points
points(xs, y[2:k], pch = 19)
## return (x, y)
list(x = x, y = y)
}
We can visualize the example polynomial in the question by:
ViewPoly(pc)
#$x
#[1] -4.07033952 -3.77435640 -1.20748286 -0.08654384 2.14530617 2.44128930
#
#$y
#[1] 72.424185 79.912753 -4.197986 1.093443 -51.871351 -45.856876
An alternative solution, re-implementing SaddlePoly and PolyVal with R package polynom.
library(polynom)
SaddlePoly <- function (pc) {
## a polynomial needs be at least quadratic to have saddle points
if (length(pc) < 3L) {
message("A polynomial needs be at least quadratic to have saddle points!")
return(numeric(0))
}
## polynomial coefficient of the 1st derivative
## pc1 <- pc[-1] * seq_len(length(pc) - 1) ## <- removed
## roots in complex domain
croots <- solve(deriv(polynomial(pc))) ## <- use package "polynom"
## retain roots in real domain
## be careful when testing 0 for floating point numbers
rroots <- Re(croots)[abs(Im(croots)) < 1e-14]
## note that `polyroot` returns multiple root with multiplicies
## return unique real roots (in ascending order)
sort(unique(rroots))
}
xs <- SaddlePoly(pc)
#[1] -3.77435640 -1.20748286 -0.08654384 2.14530617
## a complete re-implementation using package "polynom"
PolyVal <- function (x, pc, nderiv = 0L) {
## check missing aruments
if (missing(x) || missing(pc)) stop ("arguments missing with no default!")
## create "polynomial" object
p <- polynomial(pc)
## take derivatives
i <- 0
while (i < nderiv) {
p <- deriv(p)
i <- i + 1
}
## evaluate "polynomial" with "predict"
predict(p, x)
}
PolyVal(xs, pc)
#[1] 79.912753 -4.197986 1.093443 -51.871351
## use `ViewPoly` as it is
ViewPoly(pc)
#$x
#[1] -4.07033952 -3.77435640 -1.20748286 -0.08654384 2.14530617 2.44128930
#
#$y
#[1] 72.424185 79.912753 -4.197986 1.093443 -51.871351 -45.856876
In my opinion, polynom package makes construction of a polynomial easy. The poly.calc function allows a polynomial to be constructed from its roots or a Lagrange interpolation.
## (x - 1) ^ 3
p1 <- poly.calc(rep(1, 3))
## x * (x - 1) * (x - 2) * (x - 3)
p2 <- poly.calc(0:3)
## Lagrange interpolation through 0:4 and rnorm(5, 0:4, 1)
set.seed(0); x <- 0:4; y <- rnorm(5, 0:4, 1)
p3 <- poly.calc(x, y)
To view those polynomials, we can use function plot.polynomial from polynom or PolyView. However, the two functions have different logic in choosing xlim for the plot.
par(mfrow = c(3, 2), mar = c(4, 4, 1, 1))
## plot `p1`
plot(p1)
ViewPoly(unclass(p1))
## plot `p2`
plot(p2)
ViewPoly(unclass(p2))
## plot `p3`
plot(p3)
ViewPoly(unclass(p3))

Trouble estimating E[u(X)] by simulation when u() is exponential

I'm trying to use R to estimate E[u(X)] where u is a utility function and X is a random variable. More specifically, I want to be able to rank E[u(X)] and E[u(Y)] for two random variables X and Y -- only the ranking matters.
My problem is that u(x) = -exp(-sigma * x) for some sigma > 0, and this converges very rapidly to zero. So I have many cases where I expect, say, E[u(X)] > E[u(Y)], but because they are so close to zero, my simulation cannot distinguish them.
Does anyone have any advice for me?
I am only interested in ranking the two expected utilities, so u(x) can be replaced by any u.tilde(x) = a * u(x) + b, where a > 0 and b can be any number.
Below is an example where X and Y are both normal (in which case I think there is a closed form solution, but pretend X and Y have complicated distributions that I can only simulate from).
get.u <- function(sigma=1) {
stopifnot(sigma > 0)
utility <- function(x) {
return(-exp(-sigma * x))
}
return(utility)
}
u <- get.u(sigma=1)
curve(u, from=0, to=10) # Converges very rapidly to zero
n <- 10^4
x <- rnorm(n, 10^4, sd=10)
y <- rnorm(n, 10^4, sd=10^3)
mean(u(x)) == mean(u(y)) # Returns True (they're both 0), but I expect E[u(x)] > E[u(y)]
## An example of replacing u with a*u + b
get.scaled.u <- function(sigma=1) {
stopifnot(sigma > 0) # Risk averse
utility <- function(x) {
return(-exp(-sigma * x + sigma * 10^4))
}
return(utility)
}
u <- get.scaled.u(sigma=1)
mean(u(x)) > mean(u(y)) # True as desired
x <- rnorm(n, 10^4, sd=10^3)
y <- rnorm(n, 10^4, sd=2*10^3)
mean(u(x)) > mean(u(y)) # False again -- they're both -Inf
Is finding a clever way to scale u the correct way to deal with this problem? For example, suppose X and Y both have bounded support -- if I know the bounds, how can I scale u to guarantee that a*u + b will be neither too close to -Inf, nor too close to zero?
Edit: I didn't know about multiple precision packages. Rmpfr is helpful:
library(Rmpfr)
x.precise <- mpfr(x, 100)
y.precise <- mpfr(y, 100)
mean(u(x.precise)) > mean(u(y.precise)) # True

Explain the quantile() function in R

I've been mystified by the R quantile function all day.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
From the docs:
Q[i](p) = (1 - gamma) x[j] + gamma
x[j+1],
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
where 1 <= i <= 9, (j-m)/n <= p <
(j-m+1)/ n, x[j] is the jth order
statistic, n is the sample size, and m
is a constant determined by the sample
quantile type. Here gamma depends on
the fractional part of g = np+m-j.
So, how calculate j? m?
For the continuous sample quantile
types (4 through 9), the sample
quantiles can be obtained by linear
interpolation between the kth order
statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta
+ 1),
where α and β are constants determined
by the type. Further, m = alpha + p(1
- alpha - beta), and gamma = g.
Now I'm really lost. p, which was a constant before, is now apparently a function.
So for Type 7 quantiles, the default...
Type 7
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R source code
or type: quantile.default
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
Also, that last sentence is just wrong. It should read
Here gamma depends on the fractional part of np+m, g = np+m-j
As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.
As for the last bit, R is just stating what S uses.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.
Consider the question of what to do when your observations don't fall on quantiles exactly.
The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".
So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.
The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.
See what happens with the different types when you put in vectors like 1:5 and 1:6.
(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).
I believe the R help documentation is clear after the revisions noted in #RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.
To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.
You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.
quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
, type = 7, ...){
if(is.factor(x)) { #worry about non-numeric data
if(!is.ordered(x) || ! type %in% c(1L, 3L))
stop("factors are not allowed")
lx <- levels(x)
} else lx <- NULL
if (na.rm){
x <- x[!is.na(x)]
} else if (anyNA(x)){
stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
}
eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
stop("'probs' outside [0,1]")
#####################################
# here is where terms really used in default type==7 situation get defined
n <- length(x) #how many observations are in sample?
if(na.p <- any(!p.ok)) { # set aside NA & NaN
o.pr <- probs
probs <- probs[p.ok]
probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
}
np <- length(probs) #how many quantiles are you computing?
if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
if(type == 7) { # be completely back-compatible
index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
lo <- floor(index) #this is the observed order statistic just below each quantile
hi <- ceiling(index) #above
x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort,
#and it only guarantees that sorting is "right" at these order statistics, important for large vectors
#ties are not broken and tied elements just stay in their original order
qs <- x[lo] #the values associated with the "floor" order statistics
i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value
#this is the difference between the order statistic and the available ranks, i think
h <- (index - lo)[i] # > 0 by construction
## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high.
# h is the arithmetic difference between the desired order statistic amd the available ranks
#interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd.
# This means having a more extreme 99th observation doesn't matter when computing the .75 quantile
###################################
# print all of these things
cat("floor pos=", c(lo))
cat("\nceiling pos=", c(hi))
cat("\nfloor values= ", c(x[lo]))
cat( "\nwhich floors not targets? ", c(i))
cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
cat( "\nadjustment values= ", c(h))
cat("\nquantile estimates:")
}else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
} else {n * probs} # m = 0
j <- floor(nppm)
h <- switch(type,
(nppm > j), # type 1
((nppm > j) + 1)/2, # type 2
(nppm != j) | ((j %% 2L) == 1L)) # type 3
} else{
## Types 4 through 9 are continuous sample qs.
switch(type - 3,
{a <- 0; b <- 1}, # type 4
a <- b <- 0.5, # type 5
a <- b <- 0, # type 6
a <- b <- 1, # type 7 (unused here)
a <- b <- 1 / 3, # type 8
a <- b <- 3 / 8) # type 9
## need to watch for rounding errors here
fuzz <- 4 * .Machine$double.eps
nppm <- a + probs * (n + 1 - a - b) # n*probs + m
j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
h <- nppm - j
if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
x <- sort(x, partial =
unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
)
x <- c(x[1L], x[1L], x, x[n], x[n])
## h can be zero or one (types 1 to 3), and infinities matter
#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]
## also h*x might be invalid ... e.g. Dates and ordered factors
qs <- x[j+2L]
qs[h == 1] <- x[j+3L][h == 1]
other <- (0 < h) & (h < 1)
if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
}
} else {
qs <- rep(NA_real_, np)}
if(is.character(lx)){
qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
if(names && np > 0L) {
names(qs) <- format_perc(probs)
}
if(na.p) { # do this more elegantly (?!)
o.pr[p.ok] <- qs
names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
names(o.pr)[p.ok] <- names(qs)
o.pr
} else qs
}
####################
# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)
#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)
# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.
#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))

Resources