How to automatically compute Greatest Common RATIONAL Divisor in R - r

I'm trying to set a function in R to compute the Greatest Common RATIONAL Divisor of a vector. So I'm not working with a vector of integers, but of numerics. And from this vector I would like to automatically determine the highest numeric that can divide all the values in the vector and result in a integer. Which is very difficult with floating-point arithmetic used in R.
To give an example, lets say that I would like to find the highest common rational divisor of 5, 0.37 and 0.02. It's 0.01, but how can I automate this, taking into account that with floating-point arithmetic 0.37 will for instance be considered like 0.3700000000000000000000000005271 in R (something like that) ? With that problem I can't easily compute the lowest order of decimal (10^-2 in my example), or if you prefer the lowest one that has significance for me as user.
The fact that the result in itself will have floating-point-like error (e.g. 0.0100000000000000000000008465 in place of 0.01) is not a problem. However ideally the solution should be the most general possible (capable of working with vectors having extremely different values (10^20 and 10^-20 for instance).

I got a solution. So the basic idea is to divide everything by the smallest value, to multiply by integers, until every value is made of integers. This is controlled via the floor() function, which allows to have explicit control over the level of tolerance. I added some control to the amount of possibilities it tests, to make it kind of efficient, but I'm not sure this is the best method. Anyway I'll put this in the StratigrapheR package
divisor <- function(x, tolerance = 8, relative = T, tries = 4, speak = T)
{
if(!(isFALSE(relative) | isTRUE(relative))) {
stop("The 'relative' parameter should be TRUE or FALSE'")
}
if(!(isFALSE(speak) | isTRUE(speak))) {
stop("The 'speak' parameter should be TRUE or FALSE'")
}
x <- unique(x)
x <- x[x != 0]
# Divide by smallest
mx <- min(x)
d <- x/mx
if(!relative){
# Test if tolerance is of lower order than the smallest
if(-log10(mx) > tolerance) {
stop(paste("If 'relative' is FALSE, the smallest value (zero excepted)",
"should of higher order than the order",
"defined by the 'tolerance' parameter"))
}
}
# Test the dispersion of values
general_tolerance <- 15 # Order of digits affected by floating-point
if(log10(max(x)) >= (general_tolerance - tolerance)){
stop(paste("The range of 'x' values is too large to find a meaningful",
"greatest common rational divisor.",
"To solve this problem you can change the values in x or",
" lower the 'tolerance' parameter (i.e. the",
"tolerance for floating-point aritmetics):in the later case be",
"critical of the result."))
}
# Test and remove values that are multiples of the smallest value
remain1 <- (d - floor(d + 10^-(tolerance - 1)))
if(!relative) remain1 <- remain1 * mx
rzero1 <- abs(remain1) < 10^-tolerance
d <- d[!rzero1]
if(length(d) == 0) {
if(!relative){
res <- round(mx, tolerance)
} else {
res <- signif(mx, tolerance)
}
} else {
# Multiply d [x/min(x)] by integers, and test if
# this returns only integers within tolerance
ld <- length(d)
try_order_OLD <- 0
try_order_i <- 6 - ceiling(log10(ld))
for(i in seq_len(as.integer(tries))){
if(speak) {
print(paste("Try ",i,": 10^",try_order_i,
" possibilities tested", sep = ""))
}
t <- 1:(10^try_order_i)
t <- t[-(1:(10^try_order_OLD))]
lt <- length(t)
tmat <- matrix(rep(t,ld), ncol = ld)
dmat <- matrix(rep(d, lt), ncol = ld, byrow = T)
test <- dmat * tmat
remain2 <- (test - floor(test + 10^-(tolerance - 1)))
remain2 <- remain2/tmat
if(!relative) remain2 <- remain2 * mx
rzero2 <- abs(remain2) < 10^-tolerance
rzero2 <- matrix(as.integer(rzero2), ncol = ld)
test[which(rowSums(rzero2) == ld),]
res <- mx/t[which(rowSums(rzero2) == ld)[1]]
if(!is.na(res)) break
try_order_OLD <- try_order_i
try_order_i <- try_order_i + 1
}
}
return(res)
}

Related

R. Add conditions within optimization function

I have the following optimization function, similar as this one:
R. Run optimization function in data frame:
main <- function(p1, p2, n1, n2, pE) {
# FIND MINIMUM a
func <- function(a) {
Mopt <- (p1-a*pE)/(1-a)
f_n <- (Mopt-p2)^2-Mopt*(1-Mopt)/(n1-1) - p2*(1-p2)/(n2-1)
f_d <- Mopt*(1-p2)+p2*(1-Mopt)
f_v <- f_n/f_d
}
opt <- optimize(func, seq(0, 1,by=0.01), maximum=FALSE)$minimum
}
Here, the arguments of "main" are columns from a data frame. The function returns the minimum value of "a" required to get the minimum f_v value. I would like to add some conditions to the function, or in other words, to force a certain objects to acquire values within a certain range, in order to get the minimum f_v. For instance, Mopt must follow:
0 < Mopt < 1
and (1 - a) must follow:
(1 - a) > 0.
I am not sure how to do this in the context of an optimization.
I think simply add the condition would work. Code might be like this.
func <- function(a) {
a = min(a, 1-1e-7);
Mopt <- max(min((p1-a*pE)/(1-a), 1-1e-7), 1e-7);
#here means a<1 and 0<Mopt<1. 1e-7 ensures the inequality. It can be 1e-6 or 1e-8 depends on the precision you need
f_n <- (Mopt-p2)^2-Mopt*(1-Mopt)/(n1-1) - p2*(1-p2)/(n2-1)
f_d <- Mopt*(1-p2)+p2*(1-Mopt)
f_v <- f_n/f_d
}
opt <- seq(0, 1, 1e-7)[which.min(sapply(seq(0, 1, 1e-7), func))]
Addition:
Code above would return the right $objective but may fail in searching minimum. To search minimum, the function should be.
func <- function(a) {
if ((1-a)<1e-7) return(Inf);
#Ensure the optimization is reached in the range of condition
Mopt <-(p1-a*pE)/(1-a);
if (Mopt<1e-7 || Mopt>(1-1e-7)) return(Inf);
#Ensure the optimization is reached in the range of condition
f_n <- (Mopt-p2)^2-Mopt*(1-Mopt)/(n1-1) - p2*(1-p2)/(n2-1)
f_d <- Mopt*(1-p2)+p2*(1-Mopt)
f_v <- f_n/f_d
}
opt <- seq(0, 1, 1e-7)[which.min(sapply(seq(0, 1, 1e-7), func))]
It is very time consuming but available when you do not need to repeat the computation for many times.

R - finding roots for a cartesian product of function parameters

Given a function f(x,c,d) of x that also depends on some parameters c and d. I would like to find the zeroes for a cartesian product of certain values c_1,...,c_n and d_1,...,d_m of the parameters, i.e. an x_ij such that f(x_ij,c_i,d_j)=0 for i=1,...,n and j=1,...,m. Although not that crucial I am applying a Newton-Raphson algorithm for the root finding:
newton.raphson <- function(f, a, b, tol = 1e-5, n = 1000){
require(numDeriv) # Package for computing f'(x)
x0 <- a # Set start value to supplied lower bound
k <- n # Initialize for iteration results
# Check the upper and lower bounds to see if approximations result in 0
fa <- f(a)
if (fa == 0.0){
return(a)
}
fb <- f(b)
if (fb == 0.0) {
return(b)
}
for (i in 1:n) {
dx <- genD(func = f, x = x0)$D[1] # First-order derivative f'(x0)
x1 <- x0 - (f(x0) / dx) # Calculate next value x1
k[i] <- x1 # Store x1
# Once the difference between x0 and x1 becomes sufficiently small, output the results.
if (abs(x1 - x0) < tol) {
root.approx <- tail(k, n=1)
res <- list('root approximation' = root.approx, 'iterations' = k)
return(res)
}
# If Newton-Raphson has not yet reached convergence set x1 as x0 and continue
x0 <- x1
}
print('Too many iterations in method')
}
The actual function that I am interest is more complicated, but the following example illustrates my problem.
test.function <- function(x=1,c=1,d=1){
return(c*d-x)
}
Then for any given c_i and d_j I can easily calculate the zero by
newton.raphson(function(x) test.function(x,c=c_i,d=d_j),0,1)[1]
which here is obviously just the product c_i*d_j.
Now I tried to define a function that finds for two given vectors (c_1,...,c_n) and (d_1,...,d_m) the zeroes for all combinations. For this, I tried to define
zeroes <- function(ci=1,dj=1){
x<-newton.raphson(function(x) test.function(x,c=ci,d=dj),0,1)[1]
return(as.numeric(x))
}
and then use the outer-function, e.g.
outer(c(1,2),c(1,2,3),FUN=zeroes)
Unfortunately, this did not work. I got an error message
Error during wrapup: dims [product 6] do not match the length of object [1]
There might be also a much better solution to my problem. I am happy for any input.

Interpreting [R] Greatest Common Divisor (GCD) (and LCM) Function in {numbers} package

I don't have background in programming (except from wrestling with R to get things done), and I'm trying to verbalize what the formula for the greater common divisor in the R {numbers} package is trying to do at each step. I need help with understanding the flow of steps within the function:
function (n, m)
{
stopifnot(is.numeric(n), is.numeric(m))
if (length(n) != 1 || floor(n) != ceiling(n) || length(m) !=
1 || floor(m) != ceiling(m))
stop("Arguments 'n', 'm' must be integer scalars.")
if (n == 0 && m == 0)
return(0)
n <- abs(n)
m <- abs(m)
if (m > n) {
t <- n
n <- m
m <- t
}
while (m > 0) {
t <- n
n <- m
m <- t%%m
}
return(n)
}
<environment: namespace:numbers>
For instance, in the if (m > n) {} part the n becomes t and ultimately it becomes m? I'm afraid to ask, because it may be painfully obvious, but I don't know what is going on. The same apply to, I guess, he else part of the equation with %% being perhaps modulo.
What it says is:
Stop if either m or n are not numeric, more than one number, or have decimals, and return the message, "Arguments 'n', 'm' must be integer scalars."
If they both are zero, return zero.
Using absolute values from now on.
Make sure that n > m because of the algorithm we'll end up applying in the next step. If this is not the case flip them: initially place n in a temporary variable "t", and assign m to n, so that now the larger number is at the beginning of the (n, m) expression. At this point both the initial (n, m) values contain m. Finish it up by retrieving the value in the temporary variable and assigning it to m.
Now they apply the modified Euclidean algorithm to find the GCD - a more efficient version of the algorithm that shortcuts the multiple subtractions, instead replacing the larger of the two numbers by its remainder when divided by the smaller of the two.
The smaller number at the beginning of the algorithm will end up being the larger after the first iteration, therefore we'll assign it to n to get ready for the second iteration. To do so, though, we need to get the current n out of the way by assigning it to the temporary variable t. After that we get the modulo resulting from dividing the original larger number (n), which now is stored in t, by the smaller number m. The result will replace the number stored in m.
As long as there is a remainder (modulo) the process will go on, this time with the initial smaller number, m playing the role of the big guy. When there is no remainder, the smaller of the numbers in that particular iteration is returned.
ADDENDUM:
Now that I know how to read this function, I see that it is limited to two numbers in the input to the function. So I entertained myself putting together a function that can work with three integers in the input:
require(numbers)
GCF <- function(x,y,z){
tab.x <- tabulate(primeFactors(x))
tab.y <- tabulate(primeFactors(y))
tab.z <- tabulate(primeFactors(z))
max.len <- max(length(tab.x), length(tab.y), length(tab.z))
tab_x = c(tab.x, rep(0, max.len - length(tab.x)))
tab_y = c(tab.y, rep(0, max.len - length(tab.y)))
tab_z = c(tab.z, rep(0, max.len - length(tab.z)))
GCD_elem <- numeric()
for(i in 1:max.len){
GCD_elem[i] <- min(tab_x[i], tab_y[i], tab_z[i]) * i
}
GCD_elem <- GCD_elem[!GCD_elem==0]
GrCD <- prod(GCD_elem)
print(GrCD)
}
Also for the LCM:
LCM <- function(x,y,z){
tab.x <- tabulate(primeFactors(x))
tab.y <- tabulate(primeFactors(y))
tab.z <- tabulate(primeFactors(z))
max.len <- max(length(tab.x), length(tab.y), length(tab.z))
tab_x = c(tab.x, rep(0, max.len - length(tab.x)))
tab_y = c(tab.y, rep(0, max.len - length(tab.y)))
tab_z = c(tab.z, rep(0, max.len - length(tab.z)))
LCM_elem <- numeric()
for(i in 1:max.len){
LCM_elem[i] <- i^(max(tab_x[i], tab_y[i], tab_z[i]))
}
LCM_elem <- LCM_elem[!LCM_elem==0]
LCM <- prod(LCM_elem)
print(LCM)
}

Recursively Inverting a linear system - getting huge errors (precision)

I have a Ax =b type linear system - where A is an upper-triangular matrix. The structure of A is defined as follows:
comp.Amat <- function(i,j,prob) ifelse(i > j, 0, dbinom(x=i, size=j, prob=prob))
prob <- 1/4
A <- outer(1:50, 1:50 , FUN=function(r,c) comp.Amat(r,c,prob) )
The entries in A are binomial probabilities - and the issue is the diagonal entries fastly approach to 0 when the size of A grows.
If we define the vector b as follows as well:
b <- seq(1,50,1);
Then solve(a=A,b=b) - gives an error:
" system is computationally singular: reciprocal condition number = 1.07584e-64"
That makes sense, since the diagonal entries are almost 0, so the matrix becomes non-invertible.
As a work-around, I have written the following recursive function - which starts to compute the value of last diagonal entry, then replaces that value in the previous rows. Since each entry in matrix is dbinom(j,i, prob) for j=>i :I can get a solution via this way.
solve.for.x.custom <- function(A, b, prob)
{
n =length(A[1,])
m =length(A[,1])
x = seq(1,n, 1);
x[x> 0] = -1000;
calc.inv.Aii <- function(i,j, prob)
{
res = (1 / (prob*(1-prob)))^i;
return(res);
}
for (i in m:1 )
{
if(i ==m)
{
rhs =0;
}else
{
rhs=0;
for(j in m:(i+1))
{
rhs = dbinom(x=i,size=j,prob=prob)*x[j] + rhs;
}
}
x[i] = (b[i] - rhs)*calc.inv.Aii(i,i);
}
print(x)
return(x)
}
My problem is - when I multiply this solution x' by matrix A, the errors (Ax'- b) are huge. Since I have an analytical solution (each entry in x_i can be described as a in terms of binomial probabilities multiplies by previous values) - the error I should get is 0- in each row.
I see that (1 / (1/a)) may not be equal to a because of these issues. However, the current errors are really big( -1.13817489781529e+168).
x_prime=solve.for.x.custom(A, b, prob)
A%*%x_prime - b
#output
[,1]
[1,] -1.13817489781529e+168
[2,] 2.11872209742428e+167
[3,] -1.58403954589004e+166
[4,] 6.52328959209082e+164
[5,] -1.69562573261261e+163
[6,] 3.00614551450976e+161
***
[49,] -7.58010305220250e+08
[50,] 9.65162608741321e+03
I would really appreciate it you'd recommend any suggestions or efficient methods. I gave the size of A and b as 50 -but I intend to grow them as well thus in that case this the error will increase also.
If your matrix A is upper triangular you probably want to use backsolve(A, b) rather than solve(A, b).
You can do arbitrary precision in R with Rmpfr, which will require writing a compatible version of backsolve. With the code below the break we can get
> print(max(abs(b - .b)), digits=5)
1 'mpfr' number of precision 1024 bits
[1] 2.9686e-267
There is one important caveat though: the values in A may not be accurate enough since they come from dbinom rather than using mpfr objeccts. Depending on your end goal, you may need to write your own version of dbinom using Rmpfr.
library(Rmpfr)
logcomp.Amat <- function(i,j,prob) ifelse(i > j, -Inf, dbinom(x=i, size=j, prob=prob, log=TRUE))
nbits <- 1024
.backsolve <- function(A, b) {
n <- length(b)
x <- mpfr(numeric(n), nbits)
for(i in rev(seq_len(n))) {
known <- i + seq_len(n - i)
z <- if(length(known) > 0) sum(A[i,known] * x[known]) else 0
x[i] <- (b[i] - z) / A[i,i]
}
return(x)
}
logA <- outer(1:50, 1:50, logcomp.Amat, prob=1/4)
b <- 1:50
A <- exp(mpfr(logA, nbits))
b <- mpfr(b, nbits)
x <- .backsolve(A, b)
.b <- as.vector(A %*% x)

Explain the quantile() function in R

I've been mystified by the R quantile function all day.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
From the docs:
Q[i](p) = (1 - gamma) x[j] + gamma
x[j+1],
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
where 1 <= i <= 9, (j-m)/n <= p <
(j-m+1)/ n, x[j] is the jth order
statistic, n is the sample size, and m
is a constant determined by the sample
quantile type. Here gamma depends on
the fractional part of g = np+m-j.
So, how calculate j? m?
For the continuous sample quantile
types (4 through 9), the sample
quantiles can be obtained by linear
interpolation between the kth order
statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta
+ 1),
where α and β are constants determined
by the type. Further, m = alpha + p(1
- alpha - beta), and gamma = g.
Now I'm really lost. p, which was a constant before, is now apparently a function.
So for Type 7 quantiles, the default...
Type 7
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R source code
or type: quantile.default
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
Also, that last sentence is just wrong. It should read
Here gamma depends on the fractional part of np+m, g = np+m-j
As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.
As for the last bit, R is just stating what S uses.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.
Consider the question of what to do when your observations don't fall on quantiles exactly.
The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".
So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.
The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.
See what happens with the different types when you put in vectors like 1:5 and 1:6.
(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).
I believe the R help documentation is clear after the revisions noted in #RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.
To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.
You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.
quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
, type = 7, ...){
if(is.factor(x)) { #worry about non-numeric data
if(!is.ordered(x) || ! type %in% c(1L, 3L))
stop("factors are not allowed")
lx <- levels(x)
} else lx <- NULL
if (na.rm){
x <- x[!is.na(x)]
} else if (anyNA(x)){
stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
}
eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
stop("'probs' outside [0,1]")
#####################################
# here is where terms really used in default type==7 situation get defined
n <- length(x) #how many observations are in sample?
if(na.p <- any(!p.ok)) { # set aside NA & NaN
o.pr <- probs
probs <- probs[p.ok]
probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
}
np <- length(probs) #how many quantiles are you computing?
if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
if(type == 7) { # be completely back-compatible
index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
lo <- floor(index) #this is the observed order statistic just below each quantile
hi <- ceiling(index) #above
x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort,
#and it only guarantees that sorting is "right" at these order statistics, important for large vectors
#ties are not broken and tied elements just stay in their original order
qs <- x[lo] #the values associated with the "floor" order statistics
i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value
#this is the difference between the order statistic and the available ranks, i think
h <- (index - lo)[i] # > 0 by construction
## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high.
# h is the arithmetic difference between the desired order statistic amd the available ranks
#interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd.
# This means having a more extreme 99th observation doesn't matter when computing the .75 quantile
###################################
# print all of these things
cat("floor pos=", c(lo))
cat("\nceiling pos=", c(hi))
cat("\nfloor values= ", c(x[lo]))
cat( "\nwhich floors not targets? ", c(i))
cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
cat( "\nadjustment values= ", c(h))
cat("\nquantile estimates:")
}else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
} else {n * probs} # m = 0
j <- floor(nppm)
h <- switch(type,
(nppm > j), # type 1
((nppm > j) + 1)/2, # type 2
(nppm != j) | ((j %% 2L) == 1L)) # type 3
} else{
## Types 4 through 9 are continuous sample qs.
switch(type - 3,
{a <- 0; b <- 1}, # type 4
a <- b <- 0.5, # type 5
a <- b <- 0, # type 6
a <- b <- 1, # type 7 (unused here)
a <- b <- 1 / 3, # type 8
a <- b <- 3 / 8) # type 9
## need to watch for rounding errors here
fuzz <- 4 * .Machine$double.eps
nppm <- a + probs * (n + 1 - a - b) # n*probs + m
j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
h <- nppm - j
if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
x <- sort(x, partial =
unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
)
x <- c(x[1L], x[1L], x, x[n], x[n])
## h can be zero or one (types 1 to 3), and infinities matter
#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]
## also h*x might be invalid ... e.g. Dates and ordered factors
qs <- x[j+2L]
qs[h == 1] <- x[j+3L][h == 1]
other <- (0 < h) & (h < 1)
if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
}
} else {
qs <- rep(NA_real_, np)}
if(is.character(lx)){
qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
if(names && np > 0L) {
names(qs) <- format_perc(probs)
}
if(na.p) { # do this more elegantly (?!)
o.pr[p.ok] <- qs
names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
names(o.pr)[p.ok] <- names(qs)
o.pr
} else qs
}
####################
# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)
#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)
# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.
#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))

Resources