How to simulate a Polya urn (Martingale) like problem? - r

In a village there are living N=100 people and they decide with an
interesting way about some actions. Specifically, if someone proposes
an action, then all the N villagers vote for it with YES or NO. The
next day each villager re-adjusts her/his opinion independently from
the other villagers, and votes again with probability equal to the
probability of the total (maximum) supporters of the previous day.This
voting process continues until all N=100 agree on the same opinion.
Question:
How many voting days must pass until all N villagers vote the same ?
My effort
The answer must be the number of iterations needed.
I want to simulate in R this process which is a Polya Urn like (I believe) process but here we don't have red =1 and green = 1 ball in the urn.We have N balls (voters).
Also we have random i people in YES and j people on NO on X_{0} the first day.
Therefore we have p = i/N and q =j/n.
Now the next day each villager will vote again but with probability equal to maximum probability of the previous day.
Something like
votevillage <- function(n) {
i = sample(1:N,1);i
j = N-i;j
p = i/N;p
q = 1-p;q
support = max(i,j)
while (support != n) {
vote = sample(c("YES","NO"),1,prob=c(1-p,p))
support = support + vote
}
if (vote == "YES")
return(1)
else
return(0)
}
n = 100
trials = 100000
simlist = replicate(trials, votevillage(n))
mean(simlist)
The above code is wrong.It's my idea (something like a pseudo code).

As mentioned in the comments, it depends of course on the distribution of yes voters in the first round (if all villagers voted yes (no) on the first round the whole election lasts only 1 day.
The following lines show how to simulate the voting:
nr_of_yes_votes <- function(prob, N) {
rbinom(1, N, prob)
}
nr_of_days_until_unanimity <- function(x0, N) {
i <- 1
x <- x0
while (x < N && x > 0) {
p <- x / N
x <- nr_of_yes_votes(p, N)
i <- i + 1
}
i
}
simulate <- function(prob0, N = 100, seed = 123, reps = 10000) {
set.seed(seed)
x0 <- nr_of_yes_votes(prob0, N)
mean(replicate(reps, nr_of_days_until_unanimity(x0, N)))
}
simulate(.5) ## 137.9889
simulate(0) ## 1
simulate(1) ## 1
Intuitively, the more disagreement in the beginning the longer it will take to get to unanimity. Furthermore, the problem is symmetric. Thus, we would expect something where the numebr of days peaks when there is maximum disagreement in the first voting (which corresponds to an initial voting probability of 0.5) and which declines symmetrically as we vome closer to 0 (1).
This can be nicely shown wiht the following lines:
ns <- vapply((p0 <- seq(0, 1, by = .01)), simulate, numeric(1))
plot(p0, ns, type = "l", xlab = expression(Prob[0]),
ylab = "Expected Days")

Related

Manually simulating Poisson Process in R

The following problem tells us to generate a Poisson process step by step from ρ (inter-arrival time), and τ (arrival time).
One of the theoretical results presented in the lectures gives the
following direct method for simulating Poisson process:
• Let τ0 = 0.
• Generate i.i.d. exponential random variables ρ1, ρ2, . . ..
• Let τn = ρ1 + . . . + ρn for n = 1, 2, . . . .
• For each k = 0, 1, . . ., let
Nt = k for τk ≤ t < τk+1.
Using this method, generate a realization of a Poisson process (Nt)t with λ = 0.5 on the interval [0, 20].
Generate 10000 realizations of a Poisson process (Nt)t with λ = 0.5 and use your results to estimate E(Nt) and Var(Nt). Compare the estimates
with the theoretical values.
My attempted solution:
First, I have generated the values of ρ using rexp() function in R.
rhos <-function(lambda, max1)
{
vec <- vector()
for (i in 1:max1)
{
vec[i] <- rexp(0.5)
}
return (vec)
}
then, I created τs by progressive summing of ρs.
taos <- function(lambda, max)
{
rho_vec <- rhos(lambda, max)
#print(rho_vec)
vec <- vector()
vec[1] <- 0
sum <- 0
for(i in 2:max)
{
sum <- sum + rho_vec[i]
vec[i] <- sum
}
return (vec)
}
The following function is for finding the value of Nt=k when the value of k is given. Say, it is 7, etc.
Ntk <- function(lambda, max, k)
{
tao_vec <- taos(lambda, max)
val <- max(tao_vec[tao_vec < k])
}
y <- taos(0.5, 20)
x <- seq(0, 20-1, by=1)
plot(x,y, type="s")
Output:
As you can see, the plot of the Poisson process is blank rather than a staircase.
If I change rexp to exp, I get the following output:
.. which is a staircase function but all steps are equal.
Why is my source code not producing the expected output?
It looks like you're using max1 to indicate how many times to sample the exponential distribution in your rhos function. I would recommend something like this:
rhosGen <- function(lambda, maxTime){
rhos <- NULL
i <- 1
while(sum(rhos) < maxTime){
samp <- rexp(n = 1, rate = lambda)
rhos[i] <- samp
i <- i+1
}
return(head(rhos, -1))
}
This will continue to sample from the exponential until the sum of these holding times is larger than the length of the given interval. head the removes the last sample so that all of the events that we keep track of definitely occur in our time interval of interest.
From here you have to generate the taos by summing the previous holding times (rhos):
taosGen <- function(lambda, maxTime){
rhos <- rhosGen(lambda, maxTime)
taos <- NULL
cumSum <- 0
for(i in 1:length(rhos)){
taos[i] <- sum(rhos[1:i])
}
return(taos)
}
Now that you have the taos we know at what time each event in the time interval (0,maxTime) occurs. This leads us to generating the associated Poisson Process by finding the value of the Nt for each t in the time interval:
ppGen <- function(lambda, maxTime){
taos <- taosGen(lambda, maxTime)
pp <- NULL
for(i in 1:maxTime){
pp[i] <- sum(taos <= i)
}
return(pp)
}
This generates the value of the Poisson Process at each integer time in the interval. I suspect that part of your issue was trying to put the tao values on the y-axis instead of the count of events that had occurred already. The following code worked for me to produce a random looking stair case, similar to your example.
y <- ppGen(0.5, 20)
x <- seq(0, 20-1, by=1)
plot(x,y, type="s")
Here's another possible implementation. The idea is to generate a vector of wait times (tau), and plot that against the list of events we're waiting for (max1)
poi.process <- function(lambda,n){
# initialize vector of total wait time for the arrival of each event:
s<-numeric(n+1)
# set S_0 = 0
s[1] <-0
# generate vector of iid Exp random variables:
x <-replicate(n,rexp(1,lambda))
# assign wait time to vector s in for loop:
for (k in 1:n){
s[k+1] <-sum(x[1:k])
}
# return vector of wait time
return(s)
}
Plotting it using stepfun will get us something like this:
n<-20
lambda <-3
# simulate list of wait time:
s_list <-poi.process(lambda,n)
# plot function:
plot(stepfun(0:(n-1), s_list),
do.points = TRUE,
pch = 16,
col.points = "red",
verticals = FALSE,
main = 'Realization of a Poisson process with lambda = 3',
xlab = 'Time of arrival',
ylab = 'Number of arrivals')
Sample Poisson process:

Dividing one number by an array of numbers in R

2520 is the smallest number that can be divided by each of the
numbers from 1 to 10 without any remainder. Use a loop to find out what is the smallest positive number that is divisible (we mean the remainder should be 0) by all of the numbers from 1 to 20?
So far for this I have the following code but it does not work:
divisors = seq(1:20)
divisors
num1 = 2520
while(TRUE){
if (num1 %% divisors == 0){
print(num1)
break
}
num1 = num1+ 1
}
Also I need help with computing probabilities in R,
How many draws in average you need to have one level A prize? To do this, please generate 1000 games and use a vector to record the number of draws.
For this I have the following code:
set.seed(1)
random_games <- sample(c('A','B'), size=1000, replace=T, prob=c(0.2,0.8))
random_games
What’s the probability to obtain the a level A prize within 5 draws? Calculate the theoretical value.
This one I am confused on how to calculate.
Since it seems the while loop does take ages to solve the problem, we will invoke recursiveness of the gcd and lcm:
gcd=function(x){
w=1:min(x)
max(w[sapply(w,function(y)all(!x%%y))])
}
lcm=function(x){
if(length(x)>2) lcm(c(prod(x[1:2])/gcd(x[1:2]),x[-(1:2)]))
else prod(x)/gcd(x)
}
lcm(c(1:20))
[1] 232792560
When you check the remainder for 1:20, you get 20 booleans as a result - and you have to make sure they are all TRUE. You can do that using all.
divisors = seq(1:20)
divisors
num1 = 2520
while(TRUE){
if (all(num1 %% divisors == 0)){
print(num1)
break
}
num1 = num1+ 1
}
or do
gcd <- function(x, y) {
while (y) {
z = x %% y
x = y
y = z
}
return (x)
}
Reduce(function(x, y) x * y %/% gcd(x, y), 3:20)

For loop in R (Special Case: Wiener Process)

I'm dealing right now with a valuation of Option prices for my university thesis.
We need to program some things in R. It's the first time I'm working with a programming software like R. I've been doing this for the last 2 weeks and this is where I went so far:
s <- 120
#Value of the stock today
sd <- 0.1
#standard deviation
d <- 0.003
#Drift
N <- 365
T <-1
dt <-T/N
t <- seq(0,T, length=N+1)
W <- c(0, cumsum(sqrt(dt)*rnorm(N)))
#plot( t, W, type="l", main="Wiener process", ylim=c(-1,1))
S <- s*exp(d+sd*W)
S
This is a simple generalized Wiener process which I want to turn into a Monte Carlo simulation.
For S there are now 366 (N+1) Values of the Stock path. What I need is a "for loop" which takes the last Value of S and allocates it into a vector (list vector), so that I can run the loop for example 10000 times, collect every last Value of S and get the average of the vector.
I have no idea how I can program such a for loop.
I would really appreciate if you could help me or give me some good hints.
Greetings from Germany
Christian
I never studied Wiener Processes, but I think this would be a simple outline of the code you're trying to achieve:
stock_prices <- s #Initialise vector of stock prices
numIter <- 10^4 #Set number of iterations in the for loop
for(i in 1:numIter) {
s <- stock_prices[i] #This is the current stock price (for ith iteration / time step)
#Calculate the next stock price here, call it next_price
#Add price of next iteration / time step to your vector:
stock_prices <- c(stock_prices, next_price)
}
stock_prices will be a vector of the 10,000 stock prices you simulated.
I don't know how you calculate the next stock price from S, but if you draw from the values of S randomly, then it might be useful to check out the function sample (type ?sample for help on it).
Hope that helps
If you just want to run code repeatedly, putting it in a function is nice (but not absolutely necessary). I will refer to all the code in your question as <your code>.
To make a function that runs your code,
my_function = function() {
<your code>
}
The function will, by default, return its last line, in this case S. You only want the last element of S, tail(S, 1). So we can modify the function to return only that:
my_function = function() {
<your code>
return(tail(S, 1))
}
We can then call it in a for loop n times and assign the result. It is best to pre-allocate the vector for the results so that an appropriately sized block of memory can be set aside for it up front:
n = 10000
results = rep(NA, n)
for (i in 1:n) {
results[i] <- my_function()
}
This is equivalent to
n = 10000
results = rep(NA, n)
for (i in 1:n) {
<your code>
results[i] <- tail(S, 1)
}
And, for that matter, it is also equivalent to
results = replicate(n, my_function())
which is a handy shortcut.
If you want to be fancy, you could parameterize your function:
my_nice_function = function(s = 120, sd = 0.1, d = 0.003, N = 365) {
T <- 1
dt <- T / N
t <- seq(0, T, length = N + 1)
W <- c(0, cumsum(sqrt(dt) * rnorm(N)))
S <- s * exp(d + sd * W)
return(tail(S, 1))
}
Now my_nice_function has default values as in your code, but you can easily adjust them, e.g., to run the 50 simulations with sd = 0.2 you can do this:
replicate(50, my_nice_function(sd = 0.2))

Simulating coin toss

In the New York Times yesterday there was a reference to a paper essentially saying that the probability of 'heads' after a 'head' appears is not 0.5 (assuming a fair coin), challenging the "hot hand" myth. I want to prove it to myself.
Thus, I am working on coding a simulation of 7 coin tosses, and counting the number of heads after the first head, provided, naturally, that there is a first head at all.
I came up with the following lines of R code, but I'm still getting NA values, and would appreciate some help:
n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Freq_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
if(sum(z==1)!=0){
y <- which(z==1)[1]
Freq_post_H[i] <- sum(z[(y+1):n])/length((y+1):n)
}else{
next()
}
Freq_post_H
}
Freq_post_H
What am I missing?
CONCLUSION: After the initial hiccups of mismatched variable names, both responses solve the question. One of the answers corrects problems in the initial code related to what happens with the last toss (i + 1) by introducing min(y + 1, n), and corrects the basic misunderstanding of next within a loop generating NA for skipped iterations. So thank you (+1).
Critically, and the reason for this appended "conclusion" the second response addresses a more fundamental or conceptual problem: we want to calculate the fraction of H's that are preceded by a H, as opposed to p(H) in whatever number of tosses remain after a head has appeared, which will be 0.5 for a fair coin.
This is a simulation of what they did in the newspaper:
nsims <- 10000
k <- 4
set.seed(42)
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
#print(x)
sum( # sum logical values, i.e. 0/1
diff(x) == 0L & # is difference between consecutive values 0?
x[-1] == 1L ) / # and are these values heads?
sum(head(x, -1) == 1L) #divide by number of heads (without last toss)
})
mean(sims, na.rm = TRUE) #NaN cases are samples without heads, i.e. 0/0
#[1] 0.4054715
k <- 7
sims <- replicate(nsims, {
x <- sample(0:1, k, TRUE)
#print(x)
sum(diff(x) == 0L & x[-1] == 1L) / sum(head(x, -1) == 1L)
})
mean(sims, na.rm = TRUE)
#[1] 0.4289402
n <- 7 # number of tosses
p <- 0.5 # probability of heads
sims <- 100 # number of simulations
Prob_post_H <- 0 # frequency of 'head'-s after first 'head'
for(i in 1:sims){
z <- rbinom(n, 1, p)
if(sum(z==1) != 0){
y <- which(z==1)[1]
Prob_post_H[i] <- mean(z[min(y+1, n):n], na.rm=TRUE)
}else{
next()
}
}
mean(Prob_post_H,na.rm=TRUE)
#[1] 0.495068
It looks like it's right around 50%. We can scale up to see more simulations.
sims <- 10000
mean(Prob_post_H,na.rm=TRUE)
#[1] 0.5057866
Still around 50%.
This is to simulate 100 fair coin tosses 30,000 times
counter <- 1
coin <- sum(rbinom(100,1,0.5))
while(counter<30000){
coin <- c(coin, sum(rbinom(100,1,0.5)))
counter <- counter+1
}
Try these after running above variable
hist(coin)
str(coin)
mean(coin)
sd(coin)
Below is some sample code in R to simulate a fair coin toss in R using the sample function. You can modify it as you like to simulate any number of flips. Since the outcome of flipping a coin is independent for each flip, the probability of a head or tail is always 0.5 for any given flip. Over many coin flips the probability of at least half of the flips being heads (or tails) will converge to 0.5. The probability that you get exactly half heads and half tails approaches 0.
n <- 7
count_heads <- 0
coin_flip <- sample(c(0,1), n, replace = TRUE)
for(flip_i in 1:n)
{
if(coin_flip[flip_i] == 1)
{
count_heads = count_heads + 1
}
}
count_heads/n

Explain the quantile() function in R

I've been mystified by the R quantile function all day.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
From the docs:
Q[i](p) = (1 - gamma) x[j] + gamma
x[j+1],
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
where 1 <= i <= 9, (j-m)/n <= p <
(j-m+1)/ n, x[j] is the jth order
statistic, n is the sample size, and m
is a constant determined by the sample
quantile type. Here gamma depends on
the fractional part of g = np+m-j.
So, how calculate j? m?
For the continuous sample quantile
types (4 through 9), the sample
quantiles can be obtained by linear
interpolation between the kth order
statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta
+ 1),
where α and β are constants determined
by the type. Further, m = alpha + p(1
- alpha - beta), and gamma = g.
Now I'm really lost. p, which was a constant before, is now apparently a function.
So for Type 7 quantiles, the default...
Type 7
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R source code
or type: quantile.default
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
Also, that last sentence is just wrong. It should read
Here gamma depends on the fractional part of np+m, g = np+m-j
As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.
As for the last bit, R is just stating what S uses.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.
Consider the question of what to do when your observations don't fall on quantiles exactly.
The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".
So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.
The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.
See what happens with the different types when you put in vectors like 1:5 and 1:6.
(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).
I believe the R help documentation is clear after the revisions noted in #RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.
To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.
You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.
quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
, type = 7, ...){
if(is.factor(x)) { #worry about non-numeric data
if(!is.ordered(x) || ! type %in% c(1L, 3L))
stop("factors are not allowed")
lx <- levels(x)
} else lx <- NULL
if (na.rm){
x <- x[!is.na(x)]
} else if (anyNA(x)){
stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
}
eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
stop("'probs' outside [0,1]")
#####################################
# here is where terms really used in default type==7 situation get defined
n <- length(x) #how many observations are in sample?
if(na.p <- any(!p.ok)) { # set aside NA & NaN
o.pr <- probs
probs <- probs[p.ok]
probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
}
np <- length(probs) #how many quantiles are you computing?
if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
if(type == 7) { # be completely back-compatible
index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
lo <- floor(index) #this is the observed order statistic just below each quantile
hi <- ceiling(index) #above
x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort,
#and it only guarantees that sorting is "right" at these order statistics, important for large vectors
#ties are not broken and tied elements just stay in their original order
qs <- x[lo] #the values associated with the "floor" order statistics
i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value
#this is the difference between the order statistic and the available ranks, i think
h <- (index - lo)[i] # > 0 by construction
## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high.
# h is the arithmetic difference between the desired order statistic amd the available ranks
#interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd.
# This means having a more extreme 99th observation doesn't matter when computing the .75 quantile
###################################
# print all of these things
cat("floor pos=", c(lo))
cat("\nceiling pos=", c(hi))
cat("\nfloor values= ", c(x[lo]))
cat( "\nwhich floors not targets? ", c(i))
cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
cat( "\nadjustment values= ", c(h))
cat("\nquantile estimates:")
}else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
} else {n * probs} # m = 0
j <- floor(nppm)
h <- switch(type,
(nppm > j), # type 1
((nppm > j) + 1)/2, # type 2
(nppm != j) | ((j %% 2L) == 1L)) # type 3
} else{
## Types 4 through 9 are continuous sample qs.
switch(type - 3,
{a <- 0; b <- 1}, # type 4
a <- b <- 0.5, # type 5
a <- b <- 0, # type 6
a <- b <- 1, # type 7 (unused here)
a <- b <- 1 / 3, # type 8
a <- b <- 3 / 8) # type 9
## need to watch for rounding errors here
fuzz <- 4 * .Machine$double.eps
nppm <- a + probs * (n + 1 - a - b) # n*probs + m
j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
h <- nppm - j
if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
x <- sort(x, partial =
unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
)
x <- c(x[1L], x[1L], x, x[n], x[n])
## h can be zero or one (types 1 to 3), and infinities matter
#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]
## also h*x might be invalid ... e.g. Dates and ordered factors
qs <- x[j+2L]
qs[h == 1] <- x[j+3L][h == 1]
other <- (0 < h) & (h < 1)
if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
}
} else {
qs <- rep(NA_real_, np)}
if(is.character(lx)){
qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
if(names && np > 0L) {
names(qs) <- format_perc(probs)
}
if(na.p) { # do this more elegantly (?!)
o.pr[p.ok] <- qs
names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
names(o.pr)[p.ok] <- names(qs)
o.pr
} else qs
}
####################
# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)
#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)
# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.
#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))

Resources