Constructing Brownian bridge from 0 to T - r

I'm trying to simulate a Brownian bridge from Wiener process, but struggling with code.
Here is what i'm trying to do in math form:
B(t) = W (t) − tW (1)
It is important, that W(T) = 0, so that the process is pinned at the origin at both t=0 and t=T (should start and end with B(t)=B(T)= 0
Here I'm defining Wiener process:
n <- 1000
T <- 1
delta <- T/n
t <- seq(0,T, delta)
set.seed(17)
W1 <- cumsum(c(0,rnorm(n, 0, 1) * sqrt(delta)))
plot(t, W1 ,type = 'l',
xlab = 't', ylab = 'W(t)', xlim = c(0, T + delta))
Here I'm trying to build the Brownian bridge, but getting wrong result:
B <- W1 - t * W1[T] # The Brownian bridge from (0,0) to (1,target)
plot(times, B, type="l")
I expect B[1] == B[1001] to be [TRUE] and equal to 0, but in my case B[1001] is not equal to 0.
Any suggestions?

There's just one issue - in the definition of B. Instead you want
B <- W1 - t * W1[n + 1]
since the indexing n + 1 (i.e., choosing the last element corresponding to t = 1) refers to W1 as a vector rather than a process. Then indeed
B[1]
# [1] 0
B[1001]
# [1] 0
as expected. Note also though that using t and T as variable names is not advisable as both of them already have important roles; see ?t and ?T.

Related

Solve ODE that depends on x in R?

Consider the following ODE with boundary condition
y'[x] = (a - 1)*y[x]/x
y[1] = a*b
where a > 0, b > 0, and x > 0.
I want to solve this ODE numerically in R using the command ode from the R package deSolve.
This is, I want to calculate the solution y[x] for x>0
Questions: (i) I do not know how to include "x" in the denominator in the model equation. (ii) I also don't know how to specify the initial condition.
Disclaimer: I know this equation has an analytical solution, but I am trying to compare it with the numerical solution.
So far, I have unsuccessfully tried:
library(deSolve)
difeq <- function(x, y, parms) {
a <- parms[1]
b <- parms[2]
# model equations
dy <- y*(a - 1)/***x*** # HOW TO SPECIFY x?
# result
return( list(dy) )
}
params <- c(a = 2, b = 1)
init <- c(y = params[1]*params[2]) # HOW TO SPECIFY THE INITIAL CONDITION?
times <- seq(0,5, by = 0.1)
out <- ode(init, times, difeq, params)
plot(out)
Looking at the equation, I would say that x is equivalent to time t. Package deSolve uses time in the denominator because this is quite common, but it is not limited to time dependent systems. It can also be something else, e.g. a spatial component, just set x=t, that works exactly the same.
Note also to avoid 0 in the denominator.
library(deSolve)
difeq <- function(x, y, parms) {
a <- parms[1]
b <- parms[2]
# model equations
dy <- y * (a - 1) / x
# result
return(list(y=dy))
}
params <- c(a = 2, b = 1)
init <- c(y = unname(params[1]*params[2]))
times <- seq(1, 5, by = 0.1)
out <- ode(init, times, difeq, params)
## just rename "time" to "x"
colnames(out)[1] <- "x"
head(out)
To get the initial value at x -> 0 one may use an optimizer, run the system backwards (see separate answer) or use another solver (see CRAN Task view).
time y
1 1.0 2.0
2 1.1 2.2
3 1.2 2.4
4 1.3 2.6
5 1.4 2.8
6 1.5 3.0
7 1.6 3.2
8 1.7 3.4
Another part of the question was, how to specify the initial condition at x>0. As this is a somewhat additional question to the technical part how to include x in the denominator, I will give it as separate answer.
The approach uses a backward simulation. As negative time steps are not supported in deSolve a workaround can be used:
Instead of running the simulation from 1 to 0 backwards, we run it forward from -1 to 0.
Because of a pole at x=0, we use a small value close to zero.
In the model, we change the sign of the derivative, and as x appears also in the model, we multiply it with the sign too.
Note: this is explicitly included here to show the general approach for didactical reasons. In the special case here, signs will of course cancel out.
In the script below, we first start integration from the known initial value at x=1. Then we do the "backward" integration to estimate the initial value close to zero. Finally we run the model with the new initial value for the whole range of x and compare the results.
library(deSolve)
difeq <- function(x, y, parms, sign = 1) {
with(as.list(parms),{
dy <- sign * y * (a - 1) / (sign * x)
return(list(y=dy))
})
}
## initial simulation starting at x = 1
params <- c(a = 2, b = 1, sign = 1)
init <- c(y = params[["a"]] * params[["b"]])
times <- seq(1, 5, by = 0.1)
out1 <- ode(init, times, difeq, params)
## backwards simulation
close_to_zero <- 1e-16 # numerical precision is limited, do not decrease this more
times <- c(-1, -close_to_zero)
out0 <- ode(init, times, difeq, params, sign = -1)
## forward simulation starting close to zero
times <- c(close_to_zero, seq(0.1, 10, 0.1))
init <- c(y = out0[[nrow(out0), 2]])
cat("y[", close_to_zero,"] =", init, "\n")
out2 <- ode(init, times, difeq, params)#, hmax=0.01)
## comparison of the initial simulation with an extended time period
plot(out2, out1, lwd=c(1, 5), lty=c("solid", "dotted"), xlab="x")

R: Simulating line segment conditional

For a perceptual task, I wish to simulate multiple items, each consisting of a plotted single line with two 'breaking points' where the line abruptly changes direction. So in essence the line consists of three connected line segments (AB, BC, and CD), connecting four coordinates (Axy, Bxy, Cxy, Dyx), each with a different slope.
The line must agree with the following three conditions:
1) The total length of the line (L), which is the sum of the length of the three line segments (AB, BC, and CD) should vary between items, but always fall within the range of l1 and l2.
2) The line should fit within and take up an X*Y sized rectangle. That is, at least one x-coordinate (Ax, Bx, Cx, or Dx) should equal 0, at least one x-coordinate (Ax, Bx, Cx, or Dx) should equal X, at least one y-coordinate (Ay, By, Cy, or Dy) should be 0, at least one y-coordinate (Ay, By, Cy, or Dy) should equal Y; none of the x-coordinates should be lower than 0 or higher than X, none of the y-coordinates should be lower than 0 or higher than Y.
3) The line segments may not cross. That is, line segment AB and CD may not cross (as line BC is connected at one end to both other line segments, it cannot cross them).
I wish to do this in R. So far I've only managed a code wherein a random line is created and the code then checks if it meets all three conditions. If not, it starts anew. This method takes way too long!
Does anyone have an idea how I could make this code more efficient? Current R-code provided below.
#START WHILE LOOP
STOP = FALSE
CONDITION_COUNTER <- c(0,0,0)
while(STOP==FALSE){ #start condition checking loop
#SETTINGS:
l1 = 8 #minimum length L
l2 = 12 #maximum length L
L = runif(1,l1,l2) #length L
X = 5 #width square for length L
Y = 7 #heigth square for length L
#CREATE LINE SEGMENT:
Ax <- runif(1,0,X) #x-coordinate point A
Ay <- runif(1,0,Y) #y-coordinate point A
Bx <- runif(1,0,X) #x-coordinate point B
By <- runif(1,0,Y) #y-coordinate point B
Cx <- runif(1,0,X) #x-coordinate point C
Cy <- runif(1,0,Y) #y-coordinate point C
Dx <- runif(1,0,X) #x-coordinate point D
Dy <- runif(1,0,Y) #y-coordinate point D
#CHECK CONDITION 01 (line has to equal length L)
AB = sqrt((Ax-Bx)^2 + (Ay-By)^2) #length line segment AB
BC = sqrt((Bx-Cx)^2 + (By-Cy)^2) #length line segment BC
CD = sqrt((Cx-Dx)^2 + (Cy-Dy)^2) #length line segment CD
CONDITION_COUNTER[1] <- L == AB + BC + CD #Condition 1 satisfied (1) or not (0)?
#CHECK CONDITION 02 (line has to fill the square)
c1 = sum(c(Ax, Bx, Cx, Dx) == 0) > 0 #does one point have x-coordinate 0?
c2 = sum(c(Ax, Bx, Cx, Dx) == X) > 0 #does one point have x-coordinate X?
c3 = sum(c(Ay, By, Cy, Dy) == 0) > 0 #does one point have y-coordinate 0?
c4 = sum(c(Ay, By, Cy, Dy) == Y) > 0 #does one point have y-coordinate Y?
CONDITION_COUNTER[2] <- sum(c(c1,c2,c3,c4)) == 4 #Condition 2 satisfied (1) or not (0)?
#CHECK CONDITION 03 (line segments may not cross)
a <- max(c(Ax,Bx)); b <- min(c(Ax,Bx)); x <- a-b; x
a <- c(Ay,By)[which.max(c(Ax,Bx))]; b <- c(Ay,By)[which.min(c(Ax,Bx))]; y <- a-b; y
slopeAB <- y/x
InterceptAB <- Ay - slopeAB * Ax
c <- max(c(Cx,Dx)); d <- min(c(Cx,Dx)); x <- c-d; x
c <- c(Cy,Dy)[which.max(c(Cx,Dx))]; d <- c(Cy,Dy)[which.min(c(Cx,Dx))]; y <- c-d; y
slopeCD <- y/x
InterceptCD <- Cy - slopeCD * Cx
intersection <- (InterceptAB - InterceptCD)/(slopeCD - slopeAB) #what is the hypothetical x-coordinate of intersection?
c1 <- min(c(Ax,Bx)) <= intersection & intersection <= max(c(Ax,Bx)) #does AB contain that x-coordinate? (TRUE=yes, FALSE=no)
c1 <- (c1 -1)*-1
CONDITION_COUNTER[3] <- c1
CHECK <- (sum(CONDITION_COUNTER) == 3) #check if all conditions are met
if(CHECK == TRUE){STOP <- TRUE} #if all conditions are met, stop loop
} #END WHILE LOOP
#Plot:
plot(-1:10, -1:10, xaxt='n',yaxt='n',bty='n',pch='',ylab='',xlab='', col="white")
segments(Ax,Ay,Bx,By, lwd=2) #segment AB
segments(Bx,By,Cx,Cy, lwd=2) #segment BC
segments(Cx,Cy,Dx,Dy, lwd=2) #segment CD
#Add square that it has to fill
segments(0,0,X,0, col="red")
segments(0,0,0,Y, col="red")
segments(X,0,X,Y, col="red")
segments(0,Y,X,Y, col="red")
Since your constraints force the picture to look like your image (or perhaps a rotated copy) you can think of the problem as one of picking 4 numbers (a location on each edge) rather than 8. Intersections would be impossible, so no need to check. Pick the first three points, and then pause to check if it is
possible to extend it to the fourth (given the length constraints). As a safety valve, put a bound on the number of attempts to find a feasible solution:
dis <- function(x0,y0,x1,y1){
sqrt(sum((c(x1,y1)-c(x0,y0))^2))
}
broken.line <- function(X,Y,l1,l2,attempts = 1000){
Ax <- 0
By <- 0
Cx <- X
Dy <- Y
for(i in 1:attempts){
Ay <- runif(1,0,Y)
Bx <- runif(1,0,X)
Cy <- runif(1,0,Y)
L <- dis(Ax,Ay,Bx,By) + dis(Bx,By,Cx,Cy)
d.min <- Y - Cy #min dist to top edge
if(l1 < L + d.min && L + d.min < l2){
#it is feasible to complete this
#configuration -- calulate how much
#of top edge is a valid choice
#d.max is farthest that last point
#can be from the upper right corner:
d.max <- sqrt((l2 - L)^2 - d.min^2)
Dx <- runif(1,max(0,X-d.max),X)
points <- c(Ax,Bx,Cx,Dx,Ay,By,Cy,Dy)
return(matrix(points,ncol = 2))
}
}
NULL #can't find a feasible solution
}
It is fairly quick. With your parameters it can generate tens of thousands of solutions per second. For a quick test:
> m <- broken.line(5,7,8,12)
> m
[,1] [,2]
[1,] 0.000000 1.613904
[2,] 1.008444 0.000000
[3,] 5.000000 3.627471
[4,] 3.145380 7.000000
> plot(m,type = 'l')
Graph:

Recursive calculation of a simulated variable

I'm trying to simulate a variable and it's supposed to work like this:
v[t] = Q * v[t-1] + e[t]
e is a random error I generate using rnorm(156,0,0.001); v is what I aim to simulate; Q is a coefficient (I'm using 0.5).
The 1st value v[1] would be equal to e[1]. Then
v[2] = Q * v[1] + e[2]
v[3] = Q * v[2] + e[3]
. . .
I'm new to R, I'm trying to use a for loop but I'm struggling (I was going to publish my code here but it isn't working so I thought I wouldn't waste people's time). Thanks in advance!
This is a typical autoregressive process, which can be generated using of filter with "recursive" method.
e <- rnorm(156, 0, 0.001)
filter(x = c(0, e), filter = 0.5, method = "recursive")[-1]
Let's consider a small example with length 5 only:
set.seed(0)
e <- rnorm(5, 0, 0.1)
# [1] 0.12629543 -0.03262334 0.13297993 0.12724293 0.04146414
x <- filter(x = c(0, e), filter = 0.5, method = "recursive")
x[-1]
# [1] 0.12629543 0.03052438 0.14824212 0.20136399 0.14214614
filter is the workhorse of arima.sim, however, it is simply a computational routine with written C code and does not require the process to be stationary. Readers interested in arima.sim may continue to read:
Simulate a time series
Simulate an AR(1) process with uniform innovations
We note that the unit response to the auto-regressive process v(t)=Q*v(t-1) + u(t) is:
unit_res <- c(1, Q, Q^2, Q^3, ...)
We can generate this response using unit_res <- q^(seq_len(length(err))-1). Then, the response v to err is simply the convolution of err with this unit_res:
set.seed(123) ## for reproducibility
q <- 0.5
err <- rnorm(156,0,0.0001)
unit_res <- q^(seq_len(length(err))-1)
## first (initial value is zero) and we take the first 156 values from the convolution
v <- c(0, convolve(err,rev(unit_res),type="open")[1:156])
##head(v,20)
## [1] 0.000000e+00 -5.604756e-05 -5.104153e-05 1.303501e-04 7.222587e-05 4.904171e-05
## [7] 1.960274e-04 1.441053e-04 -5.445347e-05 -9.591202e-05 -9.252221e-05 7.614708e-05
##[13] 7.405492e-05 7.710461e-05 4.962057e-05 -3.077383e-05 1.633044e-04 1.314372e-04
##[19] -1.309431e-04 4.664044e-06
Since 156 is not a large number, another way to do this is to construct a unit response matrix for the difference equation v(t)=Q*v(t-1) + err(t) of the form:
Z = [1 0 0 0 ...
Q 1 0 0 ...
Q^2 Q 1 0 ...
Q^3 Q^2 Q 1 ...
... ... ... ... ...]
This matrix will be 156 x 156 in your case. Note that each column of this matrix is the response in time to a unit input in err at time t equaling to the column index. Since the system is linear, the response v to err=rnorm(156,0,0.001) is given by superposition of each individual unit response and can be computed by matrix multiplication v = Z %*% err.
To construct this matrix, we can use the function:
constructZ <- function(Q, N) {
r <- Q^(seq_len(N)-1)
m <- matrix(rep(r,N),nrow=N)
z <- matrix(0,nrow=N,ncol=N)
z[lower.tri(z,diag=TRUE)] <- m[row(m) <= (N+1-col(m))]
z
}
With this we have:
v <- c(0,constructZ(q, length(err)) %*% err)
which gives the same result.

nonlinear optimizaion in R

I tried to minimize the following function:
func <- function(qq){
x <- qq[1]
y <- qq[2]
output <- 1 - 2 * x + x^2 - 2 * y + 2 * x * y + y^2
return(output)
}
when x+y=1 and 0<=x,y<=1. To use gosolnp in Rsolnp package, firstly, I defined cons to use it in eqfun argument:
cons <- function(qq)
sum(qq)
Then I applied gosolnp function:
install.packages("Rsolnp")
require(Rsolnp)
gosolnp(fun = func, LB = c(0, 0), UB = c(1, 1), eqfun = cons, eqB = 1)
res$pars
[1] 0.8028775 0.1971225
res$value
[1] 2.606528e-09 -5.551115e-17
the answer should be x = 0 and y = 1, but as you can try in every run of gosolnp you will get new points which func is approximately 0 at that points (and not exactly).
Mathematica and Maple do optimization for this function very fast and give the true answer which is x = 0 and y = 1, but instead every run in R gives a new solution which is not correct.
I also tried another optimization function as spg() in alabama or DEoptim, but the problem remained unsolved.
So my question are:
1- is there any solution that I can minimize func in R?
2- is there any difference between precision in R and Mathematica and why Mathematica could give me the exact answer but R not?
Thank you in advance
If you have two variables x and y, with y = 1 - x, then you really have a problem in just one variable x. Noting that, you can reparametrise your function to be
1 - 2 * x + x^2 - 2 * (1 - x) + 2 * x * (1 - x) + (1 - x)^2
and going through the algebra shows that this is constant as a function of x. Thus any value of x in (0, 1) is a solution, and which one your algorithm converges to will basically be random: based on numerical roundoff and your choice of starting point.
The fact that gosolnp's returned value is zero to within the limits of numerical precision should have been a tipoff, or even just plotting the curve.
I can't speak to these particular packages, but nloptr(...) in package nloptr seems to work well:
# Non-Linear Optimization (package::nloptr)
F <- function(v){
x=v[1]
y=v[2]
output <- 1 - 2 * x + x^2 - 2 * y + 2 * x * y + y^2
}
Hc <- function(v) return(1-sum(v))
library(nloptr)
opt <- nloptr(x0=c(1/2,1/2), eval_f=F, lb = c(0,0), ub = c(1,1),
eval_g_eq = Hc,
opts = list(algorithm="NLOPT_GN_ISRES",maxeval=1e6))
opt$solution
# [1] 0.0005506997 0.9994492982
Your function is identically equal to 0 so there is no point in trying to minimize it.
library(Ryacas)
x <- Sym("x")
y <- 1-x
Simplify( 1 - 2 * x + x^2 - 2 * y + 2 * x * y + y^2)
which gives:
expression(0)

Explain the quantile() function in R

I've been mystified by the R quantile function all day.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
From the docs:
Q[i](p) = (1 - gamma) x[j] + gamma
x[j+1],
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
where 1 <= i <= 9, (j-m)/n <= p <
(j-m+1)/ n, x[j] is the jth order
statistic, n is the sample size, and m
is a constant determined by the sample
quantile type. Here gamma depends on
the fractional part of g = np+m-j.
So, how calculate j? m?
For the continuous sample quantile
types (4 through 9), the sample
quantiles can be obtained by linear
interpolation between the kth order
statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta
+ 1),
where α and β are constants determined
by the type. Further, m = alpha + p(1
- alpha - beta), and gamma = g.
Now I'm really lost. p, which was a constant before, is now apparently a function.
So for Type 7 quantiles, the default...
Type 7
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R source code
or type: quantile.default
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
Also, that last sentence is just wrong. It should read
Here gamma depends on the fractional part of np+m, g = np+m-j
As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.
As for the last bit, R is just stating what S uses.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.
Consider the question of what to do when your observations don't fall on quantiles exactly.
The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".
So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.
The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.
See what happens with the different types when you put in vectors like 1:5 and 1:6.
(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).
I believe the R help documentation is clear after the revisions noted in #RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.
To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.
You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.
quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
, type = 7, ...){
if(is.factor(x)) { #worry about non-numeric data
if(!is.ordered(x) || ! type %in% c(1L, 3L))
stop("factors are not allowed")
lx <- levels(x)
} else lx <- NULL
if (na.rm){
x <- x[!is.na(x)]
} else if (anyNA(x)){
stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
}
eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
stop("'probs' outside [0,1]")
#####################################
# here is where terms really used in default type==7 situation get defined
n <- length(x) #how many observations are in sample?
if(na.p <- any(!p.ok)) { # set aside NA & NaN
o.pr <- probs
probs <- probs[p.ok]
probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
}
np <- length(probs) #how many quantiles are you computing?
if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
if(type == 7) { # be completely back-compatible
index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
lo <- floor(index) #this is the observed order statistic just below each quantile
hi <- ceiling(index) #above
x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort,
#and it only guarantees that sorting is "right" at these order statistics, important for large vectors
#ties are not broken and tied elements just stay in their original order
qs <- x[lo] #the values associated with the "floor" order statistics
i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value
#this is the difference between the order statistic and the available ranks, i think
h <- (index - lo)[i] # > 0 by construction
## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high.
# h is the arithmetic difference between the desired order statistic amd the available ranks
#interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd.
# This means having a more extreme 99th observation doesn't matter when computing the .75 quantile
###################################
# print all of these things
cat("floor pos=", c(lo))
cat("\nceiling pos=", c(hi))
cat("\nfloor values= ", c(x[lo]))
cat( "\nwhich floors not targets? ", c(i))
cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
cat( "\nadjustment values= ", c(h))
cat("\nquantile estimates:")
}else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
} else {n * probs} # m = 0
j <- floor(nppm)
h <- switch(type,
(nppm > j), # type 1
((nppm > j) + 1)/2, # type 2
(nppm != j) | ((j %% 2L) == 1L)) # type 3
} else{
## Types 4 through 9 are continuous sample qs.
switch(type - 3,
{a <- 0; b <- 1}, # type 4
a <- b <- 0.5, # type 5
a <- b <- 0, # type 6
a <- b <- 1, # type 7 (unused here)
a <- b <- 1 / 3, # type 8
a <- b <- 3 / 8) # type 9
## need to watch for rounding errors here
fuzz <- 4 * .Machine$double.eps
nppm <- a + probs * (n + 1 - a - b) # n*probs + m
j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
h <- nppm - j
if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
x <- sort(x, partial =
unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
)
x <- c(x[1L], x[1L], x, x[n], x[n])
## h can be zero or one (types 1 to 3), and infinities matter
#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]
## also h*x might be invalid ... e.g. Dates and ordered factors
qs <- x[j+2L]
qs[h == 1] <- x[j+3L][h == 1]
other <- (0 < h) & (h < 1)
if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
}
} else {
qs <- rep(NA_real_, np)}
if(is.character(lx)){
qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
if(names && np > 0L) {
names(qs) <- format_perc(probs)
}
if(na.p) { # do this more elegantly (?!)
o.pr[p.ok] <- qs
names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
names(o.pr)[p.ok] <- names(qs)
o.pr
} else qs
}
####################
# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)
#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)
# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.
#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))

Resources