Consider the following ODE with boundary condition
y'[x] = (a - 1)*y[x]/x
y[1] = a*b
where a > 0, b > 0, and x > 0.
I want to solve this ODE numerically in R using the command ode from the R package deSolve.
This is, I want to calculate the solution y[x] for x>0
Questions: (i) I do not know how to include "x" in the denominator in the model equation. (ii) I also don't know how to specify the initial condition.
Disclaimer: I know this equation has an analytical solution, but I am trying to compare it with the numerical solution.
So far, I have unsuccessfully tried:
library(deSolve)
difeq <- function(x, y, parms) {
a <- parms[1]
b <- parms[2]
# model equations
dy <- y*(a - 1)/***x*** # HOW TO SPECIFY x?
# result
return( list(dy) )
}
params <- c(a = 2, b = 1)
init <- c(y = params[1]*params[2]) # HOW TO SPECIFY THE INITIAL CONDITION?
times <- seq(0,5, by = 0.1)
out <- ode(init, times, difeq, params)
plot(out)
Looking at the equation, I would say that x is equivalent to time t. Package deSolve uses time in the denominator because this is quite common, but it is not limited to time dependent systems. It can also be something else, e.g. a spatial component, just set x=t, that works exactly the same.
Note also to avoid 0 in the denominator.
library(deSolve)
difeq <- function(x, y, parms) {
a <- parms[1]
b <- parms[2]
# model equations
dy <- y * (a - 1) / x
# result
return(list(y=dy))
}
params <- c(a = 2, b = 1)
init <- c(y = unname(params[1]*params[2]))
times <- seq(1, 5, by = 0.1)
out <- ode(init, times, difeq, params)
## just rename "time" to "x"
colnames(out)[1] <- "x"
head(out)
To get the initial value at x -> 0 one may use an optimizer, run the system backwards (see separate answer) or use another solver (see CRAN Task view).
time y
1 1.0 2.0
2 1.1 2.2
3 1.2 2.4
4 1.3 2.6
5 1.4 2.8
6 1.5 3.0
7 1.6 3.2
8 1.7 3.4
Another part of the question was, how to specify the initial condition at x>0. As this is a somewhat additional question to the technical part how to include x in the denominator, I will give it as separate answer.
The approach uses a backward simulation. As negative time steps are not supported in deSolve a workaround can be used:
Instead of running the simulation from 1 to 0 backwards, we run it forward from -1 to 0.
Because of a pole at x=0, we use a small value close to zero.
In the model, we change the sign of the derivative, and as x appears also in the model, we multiply it with the sign too.
Note: this is explicitly included here to show the general approach for didactical reasons. In the special case here, signs will of course cancel out.
In the script below, we first start integration from the known initial value at x=1. Then we do the "backward" integration to estimate the initial value close to zero. Finally we run the model with the new initial value for the whole range of x and compare the results.
library(deSolve)
difeq <- function(x, y, parms, sign = 1) {
with(as.list(parms),{
dy <- sign * y * (a - 1) / (sign * x)
return(list(y=dy))
})
}
## initial simulation starting at x = 1
params <- c(a = 2, b = 1, sign = 1)
init <- c(y = params[["a"]] * params[["b"]])
times <- seq(1, 5, by = 0.1)
out1 <- ode(init, times, difeq, params)
## backwards simulation
close_to_zero <- 1e-16 # numerical precision is limited, do not decrease this more
times <- c(-1, -close_to_zero)
out0 <- ode(init, times, difeq, params, sign = -1)
## forward simulation starting close to zero
times <- c(close_to_zero, seq(0.1, 10, 0.1))
init <- c(y = out0[[nrow(out0), 2]])
cat("y[", close_to_zero,"] =", init, "\n")
out2 <- ode(init, times, difeq, params)#, hmax=0.01)
## comparison of the initial simulation with an extended time period
plot(out2, out1, lwd=c(1, 5), lty=c("solid", "dotted"), xlab="x")
Related
For a simulation study, I want to generate a set of random variables (both continuous and binary) that have predefined associations to an already existing binary variable, denoted here as x.
For this post, assume that x is generated following the code below. But remember: in real life, x is an already existing variable.
set.seed(1245)
x <- rbinom(1000, 1, 0.6)
I want to generate both a binary variable and a continuous variable. I have figured out how to generate a continuous variable (see code below)
set.seed(1245)
cor <- 0.8 #Correlation
y <- rnorm(1000, cor*x, sqrt(1-cor^2))
But I can't find a way to generate a binary variable that is correlated to the already existing variable x. I found several R packages, such as copula which can generate random variables with a given dependency structure. However, they do not provide a possibility to generate variables with a set dependency on an already existing variable.
Does anyone know how to do this in an efficient way?
Thanks!
If we look at the formula for correlation:
For the new vector y, if we preserve the mean, the problem is easier to solve. That means we copy the vector x and try to flip a equal number of 1s and 0s to achieve the intended correlation value.
If we let E(X) = E(Y) = x_bar , and E(XY) = xy_bar, then for a given rho, we simplify the above to:
(xy_bar - x_bar^2) / (x_bar - x_bar^2) = rho
Solve and we get:
xy_bar = rho * x_bar + (1-rho)*x_bar^2
And we can derive a function to flip a number of 1s and 0s to get the result:
create_vector = function(x,rho){
n = length(x)
x_bar = mean(x)
xy_bar = rho * x_bar + (1-rho)*x_bar^2
toflip = sum(x == 1) - round(n * xy_bar)
y = x
y[sample(which(x==0),toflip)] = 1
y[sample(which(x==1),toflip)] = 0
return(y)
}
For your example it works:
set.seed(1245)
x <- rbinom(1000, 1, 0.6)
cor(x,create_vector(x,0.8))
[1] 0.7986037
There are some extreme combinations of intended rho and p where you might run into problems, for example:
set.seed(111)
res = lapply(1:1000,function(i){
this_rho = runif(1)
this_p = runif(1)
x = rbinom(1000,1,this_p)
data.frame(
intended_rho = this_rho,
p = this_p,
resulting_cor = cor(x,create_vector(x,this_rho))
)
})
res = do.call(rbind,res)
ggplot(res,aes(x=intended_rho,y=resulting_cor,col=p)) + geom_point()
Here's a binomial one - the formula for q only depends on the mean of x and the correlation you desire.
set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- 1/((1-p)/cor^2+p)
y <- rbinom(100000, 1, q)
z <- x*y
cor(x,z)
#> [1] 0.7984781
This is not the only way to do this - note that mean(z) is always less than mean(x) in this construction.
The continuous variable is even less well defined - do you really not care about its mean/variance, or anything else about its distibution?
Here's another simple version where it flips the variable both ways:
set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- (1+cor/sqrt(1-(2*p-1)^2*(1-cor^2)))/2
y <- rbinom(100000, 1, q)
z <- x*y+(1-x)*(1-y)
cor(x,z)
#> [1] 0.8001219
mean(z)
#> [1] 0.57908
I'm building predator-prey models based on Lotka-Volterra derivatives in R using package deSolve. I define parameters, initial state and timesteps and the model function. Then I solve everyting using ode() or dede() when using a time lag.
I noticed there's a big difference in output depending on how you define the parameters WITHIN the model function and I really don't understand why. You can extract the parameters either by calling them via the argument: parms['r'], or via the previously defined object I passed to the argument: parameters['r']. Same result in both cases.
This is different voor the initial state though: calling the argument: y[1] or y['N'], gives a totally different result than calling it via the object passed to the argument: init[1] or init['N'].
Also in the DDE: there's a difference in time - tau vs times - tau and ylag <- y vs ylag <- init.
Why is there a different result for argument vs object for the initial state and time and not for the parameters? I need to comprehend this well in order to use the FME package in a later stage, so I hope someone can explain this behaviour.
My code:
library(deSolve)
## Parameters
parameters <- c(r = 0.25, K = 200, a = 0.01, c = 0.01, m = 1, tau = 7)
init <- c(N = 20, P = 2)
time <- seq(0, 100, by = 0.01)
## Ordinary DE
PreyPred <- function(times, y, parms){ #chose same argument names as ode()
N <- y['N'] #y[1] works as well
P <- y['P']
#N <- init['N'] #(or init[1]) gives a totally different result!
#P <- init['P']
r <- parms['r'] #growth rate prey parameters['r'] gives same result
K <- parms['K'] #carrying capacity prey
a <- parms['a'] #attack rate predator
c <- parms['c'] #assimilation rate (?) predator
m <- parms['m'] #mortality predator
dN <- r * N * (1-N/K) - a * N * P
dP <- c * N * P - m * P
return(list(c(dN, dP)))
}
oderesult <- ode(func = PreyPred, parms = parameters, y = init, times = time)
plot(oderesult, lwd = 2, mfrow = c(1,2))
## Delayed DE
PreyPredLag <- function(times, y, parms){
N <- y['N']
P <- y['P']
#N <- init['N']
#P <- init['P']
r <- parms['r'] #growth rate prey
K <- parms['K'] #carrying capacity prey
a <- parms['a'] #attack rate predator
c <- parms['c'] #assimilation rate (?) predator
m <- parms['m'] #mortality predator
tau <- parms['tau'] #time lag
tlag <- times - tau
#tlag <- time - tau #different result
if (tlag < 0)
ylag <- y
#ylag <- init
else
ylag <- lagvalue(tlag)
# dede
dN <- r * N * (1-N/K) - a * N * P
dP <- c * ylag[1] * ylag[2] - m * P
return(list(c(dN, dP), lag = ylag))
}
dederesult <- dede(func = PreyPredLag, parms = parameters, y = init, times = time)
plot(dederesult, lwd = 2, mfrow = c(2,2))
The observed behavior is correct. A short explanation:
'parms' is the local variable in the model function, while 'parameters' the global variable in the workspace. This is nothing special for deSolve, it is the general way how R works. The in most cases preferred way is to use the local variable.
For the states, this is indeed different. here the outer value 'init' is the initial value at the beginning, while the local 'y' is the current value for the time step.
The dede parameters are analogous. init is the start, y the instantaneous value, times is the global vector of all time steps and time the actual time step.
I'm playing around with apply() family of functions in R, and was trying to write a log-likelihood function using apply().
Here's the log-likelihood for a linear regression model assuming gaussian disturbances:
# Likelihood function for the standard linear regression model
logL <- function(theta, data){
# Return minus the log likelihood function for the standard linear regression model
# y: endogenous variable
# x: matrix of regressors
y <- data[, 1]
x <- data[, -1]
N <- nrow(data)
# This is the contribution to the log-likelihood of individual i. Initialized at 0.
contrib <- 0
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
for (i in 1:N){
contrib <- contrib + (y[i] - beta%*%x[i,])**2
}
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*contrib) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
And below we simulate some data and minimize the negative log-likelihood (which is equivalent to maximising the log-likelihood).
# Simulate some data
N <- 1000
x <- cbind(1, rnorm(N,0,sd=1), rnorm(N, 0, sd=2))
true_theta <- c(2, 3, 2, 4)
y <- true_theta[1:3]%*%t(x) + rnorm(N, mean = 0, sd = true_theta[4])
my_data <- cbind(t(y),x)
optim(c(1,1,1, 1), fn = logL, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01))
So far so good, we get the same results as those used to simulate the data. By using the rbenchmark package I get that 10 replications of the optimization step takes around 4 seconds on my computer.
benchmark(optim(c(1,1,1, 1), fn = logL, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01)),
replications=10)
Now I tried replacing the for-loop with the apply function. For this, I defined contrib to be a function:
contrib <- function(beta, one_obs){
y <- one_obs[1]
x <- one_obs[-1]
return((y - beta%*%x)**2)
}
And the new log-likelihood function:
logL2 <- function(theta, data){
# Return minus the log likelihood function for the standard linear regression model
# y: endogenous variable
# x: matrix of regressors
N <- nrow(data)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*sum(apply(data, FUN=contrib, beta = beta, 1)))
- 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
This is almost twice as long. Now, I may have misunderstood the role of the apply family of functions, as they should be used for code clarity rather than for performance. However, they shouldn't be slower than a for loop, right? So what is happening with my code? Is some type conversion going on? I checked and logL returns a matrix and logL2 returns a numeric. I tried using vapply() as it allows to specify the type of the object returned, but vapply() seems to convert my data matrix into a vector by stacking every column on top of each other. This causes the contrib function not to work anymore:
logL2 <- function(theta, data){
# Return minus the log likelihood function for the standard linear regression model
# y: endogenous variable
# x: matrix of regressors
N <- nrow(data)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*sum(vapply(data, FUN=contrib, beta = beta, FUN.VALUE = matrix(1)))) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
This is what I get then:
class(logL2(theta = c(1,2,2,2), my_data))
Error in beta %*% x : non-conformable arguments
So how could I use the apply family of functions to make my code more readable, and at least as fast as with a for loop?
You can simplify your code by thinking about the maths involved in your for loop.
Your for loop is
contrib <- contrib + (y[i] - beta%*%x[i,])**2
Now this is the same as just calculating all the (y[i] - beta %*% x[i, ])^2 and summing them all. Thinking about beta %*% x[i, ] you are doing matrix multiplication of a 1x3 matrix (beta) with a 3x1 (x[i, ]), giving a 1x1 result. So what you are doing is matrix-multiplying beta by each row of x independently.
However, with matrix multiplication you can do them all simultaneously anyway, and get a Nx1 matrix out!
i.e. beta (1x3) %*% x (3xN) would give you a 1xN matrix, and then subtract this from y which is also a vector of length N, square each difference independently and sum them. This is equivalent to your for loop.
The only catch is that your x is Nx3 not 3xN, so we t() it first:
contrib <- sum((y - beta %*% t(x))^2)
This does away with your for loop entirely.
logL2 <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
N <- nrow(data)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
contrib <- sum((y - beta %*% t(x))^2)
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*contrib) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
library(rbenchmark)
benchmark(
orig={orig.answer <- optim(c(1,1,1, 1), fn = logL, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01))},
new={new.answer <- optim(c(1,1,1, 1), fn = logL2, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01))},
replications=10
)
which yields
test replications elapsed relative user.self sys.self user.child sys.child
2 new 10 0.306 1.00 0.332 0.048 0 0
1 orig 10 4.584 14.98 4.588 0.000 0 0
and also let's just check we didn't make a mistake
all.equal(orig.answer, new.answer)
# [1] TRUE
As a style point, why not have y being a third argument to logL2 (rather than cbinding it to data at the start, and then having to select the appropriate row/columns all the time)? This saves you from doing the y <- data[, 1] and x <- data[, -1] all the time. I.e. do something like logL <- function (theta, x, y) { ... } and then in your optim() call you can provide the x and y arguments rather than my_data. You might even get a further improvement by doing t(x) at the very start (e.g. in your call to optim) so it doesn't have to be done every time logL2 is called?
logL3 <- function(theta, x, y){
N <- length(y)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
contrib <- sum((y - beta %*% x)^2)
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*contrib) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
benchmark(
new=optim(c(1,1,1, 1), fn = logL2, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01)),
new.new=optim(c(1,1,1, 1), fn = logL3, x=t(x), y=y,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01)),
replications=100
)
test replications elapsed relative user.self sys.self user.child sys.child
1 new 100 3.149 2.006 3.317 0.700 0 0
2 new.new 100 1.570 1.000 1.488 0.344 0 0
It's about twice as fast. In general, if you can do something once rather than every time logL2 is called (e.g. t(x), data[, 1] etc) it'll save you some small amount of time.
With respect to your original question however (specifically to do with the *apply functions:
vapply takes a list as input, and your data is a matrix, so contrib is operating on one element of data at a time. I.e. contrib sees x as a single number. Hence nonconformable matrices, since your matrix multiplication is multiplying beta (a 1x3) with x (a 1x1) and for matrix multiplication to work, you need the number of columns of beta to equal the number of rows of x. To use vapply you'd need something like
vapply(1:nrow(data), function(i) contrib(beta, data[i, ]), FUN.VALUE=1)
(! I have not tested these statements by benchmarking or anything. This is just what I have found in my experience): of all the *apply functions, I find that apply() is slow (often slower than the for-loop). It is handy for neatness of code ("do this for every row", or "do this for every column"-type tasks: instead of lots of data[i, ] it's just apply(.., MARGIN=1)), but if you need speed do a for loop or use one of the other cousins like vapply, lapply or sapply.
vapply, lapply are fast. sapply is too, but in general one of the former two is faster (sapply is easier to use due to the FUN.VALUE bit of vapply being worked out for you. Or if you know that the FUN.VALUE won't always be the same, it is equivalent to lapply so you may as well use that. Since sapply does all this working out for you it can be easier to use, but minutely slower).
fastest of all is if you can use some maths to avoid a loop! e.g. if you can rephrase your loop in terms of a matrix multiplication as I did here. Though this only applies to a very small number of situations.
So I have a system of ode's and some data I am using the R packages deSolve and FME to fit the parameters of the ode system to data. I am getting a singular matrix result when I fit the full parameter set to the data. So I went back and looked at the collinearity of the parameters using a collinearity index cut-off of 20 as suggested in all the FME package documentation I then picked a few models with subsets of parameters to fit. Then when I run modFit I get this error:
Error in approx(xMod, yMod, xout = xDat) :
need at least two non-NA values to interpolate
Can anyone enlighten me as to a fix for this. Everything else is working fine. So this is not a coding problem.
Here is a minimal working example (removing r=2 in modFit creates the error which I can fix in the minimal working example but not in my actual problem so I doubt a minimal working example helps here):
`## =======================================================================
## Now suppose we do not know K and r and they are to be fitted...
## The "observations" are the analytical solution
## =======================================================================
# You need these packages
library('deSolve')
library('FME')
## logistic growth model
TT <- seq(1, 100, 2.5)
N0 <- 0.1
r <- 0.5
K <- 100
## analytical solution
Ana <- cbind(time = TT, N = K/(1 + (K/N0 - 1) * exp(-r*TT)))
time <- 0:100
parms <- c(r = r, K = K)
x <- c(N = N0)
logist <- function(t, x, parms) {
with(as.list(parms), {
dx <- r * x[1] * (1 - x[1]/K)
list(dx)
})
}
## Run the model with initial guess: K = 10, r = 2
parms["K"] <- 10
parms["r"] <- 2
init <- ode(x, time, logist, parms)
## FITTING algorithm uses modFit
## First define the objective function (model cost) to be minimised
## more general: using modFit
Cost <- function(P) {
parms["K"] <- P[1]
parms["r"] <- P[2]
out <- ode(x, time, logist, parms)
return(modCost(out, Ana))
}
(Fit<-modFit(p = c(K = 10,r=2), f = Cost))
summary(Fit)`
I think the problem is in your Cost function. If you don't provide both K and r, then the cost function will override the start value of r to NA. You can test this:
Cost <- function(P) {
parms["K"] <- P[1]
parms["r"] <- P[2]
print(parms)
#out <- ode(x, time, logist, parms)
#return(modCost(out, Ana))
}
Cost(c(K=10, r = 2))
Cost(c(K=10))
This function works:
Cost <- function(P) {
parms[names(P)] <- P
out <- ode(x, time, logist, parms)
return(modCost(out, Ana))
}
The vignette FMEDyna is very helpful: https://cran.r-project.org/web/packages/FME/vignettes/FMEdyna.pdf See page 14 on how to specify the Objective (Cost) function.
I've been mystified by the R quantile function all day.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
From the docs:
Q[i](p) = (1 - gamma) x[j] + gamma
x[j+1],
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
where 1 <= i <= 9, (j-m)/n <= p <
(j-m+1)/ n, x[j] is the jth order
statistic, n is the sample size, and m
is a constant determined by the sample
quantile type. Here gamma depends on
the fractional part of g = np+m-j.
So, how calculate j? m?
For the continuous sample quantile
types (4 through 9), the sample
quantiles can be obtained by linear
interpolation between the kth order
statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta
+ 1),
where α and β are constants determined
by the type. Further, m = alpha + p(1
- alpha - beta), and gamma = g.
Now I'm really lost. p, which was a constant before, is now apparently a function.
So for Type 7 quantiles, the default...
Type 7
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R source code
or type: quantile.default
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
Also, that last sentence is just wrong. It should read
Here gamma depends on the fractional part of np+m, g = np+m-j
As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.
As for the last bit, R is just stating what S uses.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.
Consider the question of what to do when your observations don't fall on quantiles exactly.
The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".
So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.
The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.
See what happens with the different types when you put in vectors like 1:5 and 1:6.
(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).
I believe the R help documentation is clear after the revisions noted in #RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.
To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.
You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.
quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
, type = 7, ...){
if(is.factor(x)) { #worry about non-numeric data
if(!is.ordered(x) || ! type %in% c(1L, 3L))
stop("factors are not allowed")
lx <- levels(x)
} else lx <- NULL
if (na.rm){
x <- x[!is.na(x)]
} else if (anyNA(x)){
stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
}
eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
stop("'probs' outside [0,1]")
#####################################
# here is where terms really used in default type==7 situation get defined
n <- length(x) #how many observations are in sample?
if(na.p <- any(!p.ok)) { # set aside NA & NaN
o.pr <- probs
probs <- probs[p.ok]
probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
}
np <- length(probs) #how many quantiles are you computing?
if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
if(type == 7) { # be completely back-compatible
index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
lo <- floor(index) #this is the observed order statistic just below each quantile
hi <- ceiling(index) #above
x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort,
#and it only guarantees that sorting is "right" at these order statistics, important for large vectors
#ties are not broken and tied elements just stay in their original order
qs <- x[lo] #the values associated with the "floor" order statistics
i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value
#this is the difference between the order statistic and the available ranks, i think
h <- (index - lo)[i] # > 0 by construction
## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high.
# h is the arithmetic difference between the desired order statistic amd the available ranks
#interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd.
# This means having a more extreme 99th observation doesn't matter when computing the .75 quantile
###################################
# print all of these things
cat("floor pos=", c(lo))
cat("\nceiling pos=", c(hi))
cat("\nfloor values= ", c(x[lo]))
cat( "\nwhich floors not targets? ", c(i))
cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
cat( "\nadjustment values= ", c(h))
cat("\nquantile estimates:")
}else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
} else {n * probs} # m = 0
j <- floor(nppm)
h <- switch(type,
(nppm > j), # type 1
((nppm > j) + 1)/2, # type 2
(nppm != j) | ((j %% 2L) == 1L)) # type 3
} else{
## Types 4 through 9 are continuous sample qs.
switch(type - 3,
{a <- 0; b <- 1}, # type 4
a <- b <- 0.5, # type 5
a <- b <- 0, # type 6
a <- b <- 1, # type 7 (unused here)
a <- b <- 1 / 3, # type 8
a <- b <- 3 / 8) # type 9
## need to watch for rounding errors here
fuzz <- 4 * .Machine$double.eps
nppm <- a + probs * (n + 1 - a - b) # n*probs + m
j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
h <- nppm - j
if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
x <- sort(x, partial =
unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
)
x <- c(x[1L], x[1L], x, x[n], x[n])
## h can be zero or one (types 1 to 3), and infinities matter
#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]
## also h*x might be invalid ... e.g. Dates and ordered factors
qs <- x[j+2L]
qs[h == 1] <- x[j+3L][h == 1]
other <- (0 < h) & (h < 1)
if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
}
} else {
qs <- rep(NA_real_, np)}
if(is.character(lx)){
qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
if(names && np > 0L) {
names(qs) <- format_perc(probs)
}
if(na.p) { # do this more elegantly (?!)
o.pr[p.ok] <- qs
names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
names(o.pr)[p.ok] <- names(qs)
o.pr
} else qs
}
####################
# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)
#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)
# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.
#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))