I'm playing around with apply() family of functions in R, and was trying to write a log-likelihood function using apply().
Here's the log-likelihood for a linear regression model assuming gaussian disturbances:
# Likelihood function for the standard linear regression model
logL <- function(theta, data){
# Return minus the log likelihood function for the standard linear regression model
# y: endogenous variable
# x: matrix of regressors
y <- data[, 1]
x <- data[, -1]
N <- nrow(data)
# This is the contribution to the log-likelihood of individual i. Initialized at 0.
contrib <- 0
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
for (i in 1:N){
contrib <- contrib + (y[i] - beta%*%x[i,])**2
}
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*contrib) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
And below we simulate some data and minimize the negative log-likelihood (which is equivalent to maximising the log-likelihood).
# Simulate some data
N <- 1000
x <- cbind(1, rnorm(N,0,sd=1), rnorm(N, 0, sd=2))
true_theta <- c(2, 3, 2, 4)
y <- true_theta[1:3]%*%t(x) + rnorm(N, mean = 0, sd = true_theta[4])
my_data <- cbind(t(y),x)
optim(c(1,1,1, 1), fn = logL, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01))
So far so good, we get the same results as those used to simulate the data. By using the rbenchmark package I get that 10 replications of the optimization step takes around 4 seconds on my computer.
benchmark(optim(c(1,1,1, 1), fn = logL, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01)),
replications=10)
Now I tried replacing the for-loop with the apply function. For this, I defined contrib to be a function:
contrib <- function(beta, one_obs){
y <- one_obs[1]
x <- one_obs[-1]
return((y - beta%*%x)**2)
}
And the new log-likelihood function:
logL2 <- function(theta, data){
# Return minus the log likelihood function for the standard linear regression model
# y: endogenous variable
# x: matrix of regressors
N <- nrow(data)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*sum(apply(data, FUN=contrib, beta = beta, 1)))
- 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
This is almost twice as long. Now, I may have misunderstood the role of the apply family of functions, as they should be used for code clarity rather than for performance. However, they shouldn't be slower than a for loop, right? So what is happening with my code? Is some type conversion going on? I checked and logL returns a matrix and logL2 returns a numeric. I tried using vapply() as it allows to specify the type of the object returned, but vapply() seems to convert my data matrix into a vector by stacking every column on top of each other. This causes the contrib function not to work anymore:
logL2 <- function(theta, data){
# Return minus the log likelihood function for the standard linear regression model
# y: endogenous variable
# x: matrix of regressors
N <- nrow(data)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*sum(vapply(data, FUN=contrib, beta = beta, FUN.VALUE = matrix(1)))) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
This is what I get then:
class(logL2(theta = c(1,2,2,2), my_data))
Error in beta %*% x : non-conformable arguments
So how could I use the apply family of functions to make my code more readable, and at least as fast as with a for loop?
You can simplify your code by thinking about the maths involved in your for loop.
Your for loop is
contrib <- contrib + (y[i] - beta%*%x[i,])**2
Now this is the same as just calculating all the (y[i] - beta %*% x[i, ])^2 and summing them all. Thinking about beta %*% x[i, ] you are doing matrix multiplication of a 1x3 matrix (beta) with a 3x1 (x[i, ]), giving a 1x1 result. So what you are doing is matrix-multiplying beta by each row of x independently.
However, with matrix multiplication you can do them all simultaneously anyway, and get a Nx1 matrix out!
i.e. beta (1x3) %*% x (3xN) would give you a 1xN matrix, and then subtract this from y which is also a vector of length N, square each difference independently and sum them. This is equivalent to your for loop.
The only catch is that your x is Nx3 not 3xN, so we t() it first:
contrib <- sum((y - beta %*% t(x))^2)
This does away with your for loop entirely.
logL2 <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
N <- nrow(data)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
contrib <- sum((y - beta %*% t(x))^2)
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*contrib) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
library(rbenchmark)
benchmark(
orig={orig.answer <- optim(c(1,1,1, 1), fn = logL, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01))},
new={new.answer <- optim(c(1,1,1, 1), fn = logL2, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01))},
replications=10
)
which yields
test replications elapsed relative user.self sys.self user.child sys.child
2 new 10 0.306 1.00 0.332 0.048 0 0
1 orig 10 4.584 14.98 4.588 0.000 0 0
and also let's just check we didn't make a mistake
all.equal(orig.answer, new.answer)
# [1] TRUE
As a style point, why not have y being a third argument to logL2 (rather than cbinding it to data at the start, and then having to select the appropriate row/columns all the time)? This saves you from doing the y <- data[, 1] and x <- data[, -1] all the time. I.e. do something like logL <- function (theta, x, y) { ... } and then in your optim() call you can provide the x and y arguments rather than my_data. You might even get a further improvement by doing t(x) at the very start (e.g. in your call to optim) so it doesn't have to be done every time logL2 is called?
logL3 <- function(theta, x, y){
N <- length(y)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
contrib <- sum((y - beta %*% x)^2)
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*contrib) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
benchmark(
new=optim(c(1,1,1, 1), fn = logL2, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01)),
new.new=optim(c(1,1,1, 1), fn = logL3, x=t(x), y=y,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01)),
replications=100
)
test replications elapsed relative user.self sys.self user.child sys.child
1 new 100 3.149 2.006 3.317 0.700 0 0
2 new.new 100 1.570 1.000 1.488 0.344 0 0
It's about twice as fast. In general, if you can do something once rather than every time logL2 is called (e.g. t(x), data[, 1] etc) it'll save you some small amount of time.
With respect to your original question however (specifically to do with the *apply functions:
vapply takes a list as input, and your data is a matrix, so contrib is operating on one element of data at a time. I.e. contrib sees x as a single number. Hence nonconformable matrices, since your matrix multiplication is multiplying beta (a 1x3) with x (a 1x1) and for matrix multiplication to work, you need the number of columns of beta to equal the number of rows of x. To use vapply you'd need something like
vapply(1:nrow(data), function(i) contrib(beta, data[i, ]), FUN.VALUE=1)
(! I have not tested these statements by benchmarking or anything. This is just what I have found in my experience): of all the *apply functions, I find that apply() is slow (often slower than the for-loop). It is handy for neatness of code ("do this for every row", or "do this for every column"-type tasks: instead of lots of data[i, ] it's just apply(.., MARGIN=1)), but if you need speed do a for loop or use one of the other cousins like vapply, lapply or sapply.
vapply, lapply are fast. sapply is too, but in general one of the former two is faster (sapply is easier to use due to the FUN.VALUE bit of vapply being worked out for you. Or if you know that the FUN.VALUE won't always be the same, it is equivalent to lapply so you may as well use that. Since sapply does all this working out for you it can be easier to use, but minutely slower).
fastest of all is if you can use some maths to avoid a loop! e.g. if you can rephrase your loop in terms of a matrix multiplication as I did here. Though this only applies to a very small number of situations.
Related
Consider the following ODE with boundary condition
y'[x] = (a - 1)*y[x]/x
y[1] = a*b
where a > 0, b > 0, and x > 0.
I want to solve this ODE numerically in R using the command ode from the R package deSolve.
This is, I want to calculate the solution y[x] for x>0
Questions: (i) I do not know how to include "x" in the denominator in the model equation. (ii) I also don't know how to specify the initial condition.
Disclaimer: I know this equation has an analytical solution, but I am trying to compare it with the numerical solution.
So far, I have unsuccessfully tried:
library(deSolve)
difeq <- function(x, y, parms) {
a <- parms[1]
b <- parms[2]
# model equations
dy <- y*(a - 1)/***x*** # HOW TO SPECIFY x?
# result
return( list(dy) )
}
params <- c(a = 2, b = 1)
init <- c(y = params[1]*params[2]) # HOW TO SPECIFY THE INITIAL CONDITION?
times <- seq(0,5, by = 0.1)
out <- ode(init, times, difeq, params)
plot(out)
Looking at the equation, I would say that x is equivalent to time t. Package deSolve uses time in the denominator because this is quite common, but it is not limited to time dependent systems. It can also be something else, e.g. a spatial component, just set x=t, that works exactly the same.
Note also to avoid 0 in the denominator.
library(deSolve)
difeq <- function(x, y, parms) {
a <- parms[1]
b <- parms[2]
# model equations
dy <- y * (a - 1) / x
# result
return(list(y=dy))
}
params <- c(a = 2, b = 1)
init <- c(y = unname(params[1]*params[2]))
times <- seq(1, 5, by = 0.1)
out <- ode(init, times, difeq, params)
## just rename "time" to "x"
colnames(out)[1] <- "x"
head(out)
To get the initial value at x -> 0 one may use an optimizer, run the system backwards (see separate answer) or use another solver (see CRAN Task view).
time y
1 1.0 2.0
2 1.1 2.2
3 1.2 2.4
4 1.3 2.6
5 1.4 2.8
6 1.5 3.0
7 1.6 3.2
8 1.7 3.4
Another part of the question was, how to specify the initial condition at x>0. As this is a somewhat additional question to the technical part how to include x in the denominator, I will give it as separate answer.
The approach uses a backward simulation. As negative time steps are not supported in deSolve a workaround can be used:
Instead of running the simulation from 1 to 0 backwards, we run it forward from -1 to 0.
Because of a pole at x=0, we use a small value close to zero.
In the model, we change the sign of the derivative, and as x appears also in the model, we multiply it with the sign too.
Note: this is explicitly included here to show the general approach for didactical reasons. In the special case here, signs will of course cancel out.
In the script below, we first start integration from the known initial value at x=1. Then we do the "backward" integration to estimate the initial value close to zero. Finally we run the model with the new initial value for the whole range of x and compare the results.
library(deSolve)
difeq <- function(x, y, parms, sign = 1) {
with(as.list(parms),{
dy <- sign * y * (a - 1) / (sign * x)
return(list(y=dy))
})
}
## initial simulation starting at x = 1
params <- c(a = 2, b = 1, sign = 1)
init <- c(y = params[["a"]] * params[["b"]])
times <- seq(1, 5, by = 0.1)
out1 <- ode(init, times, difeq, params)
## backwards simulation
close_to_zero <- 1e-16 # numerical precision is limited, do not decrease this more
times <- c(-1, -close_to_zero)
out0 <- ode(init, times, difeq, params, sign = -1)
## forward simulation starting close to zero
times <- c(close_to_zero, seq(0.1, 10, 0.1))
init <- c(y = out0[[nrow(out0), 2]])
cat("y[", close_to_zero,"] =", init, "\n")
out2 <- ode(init, times, difeq, params)#, hmax=0.01)
## comparison of the initial simulation with an extended time period
plot(out2, out1, lwd=c(1, 5), lty=c("solid", "dotted"), xlab="x")
I am trying to implement a vectorized exponentially weighted moving standard deviation using R. Is this the correct approach?
ewma <- function (x, alpha) {
c(stats::filter(x * alpha, 1 - alpha, "recursive", init = x[1]))
}
ewmsd <- function(x, alpha) {
sqerror <- na.omit((x - lag(ewma(x, alpha)))^2)
ewmvar <- c(stats::filter(sqerror * alpha, 1 - alpha, "recursive", init = 0))
c(NA, sqrt(ewmvar))
}
I'm guessing it's not, since its output is different from Python's pandas.Series.ewm.std() function.
When I run
ewmsd(x = 0:9, alpha = 0.96)
the output is
[1] NA 0.2236068 0.4874679 0.7953500 1.1353903 1.4993855 1.8812961 2.2764708 2.6812160 3.0925367
However, with
pd.Series(range(10)).ewm(alpha = 0.96).std()
the output is
0 NaN
1 0.707107
2 0.746729
3 0.750825
4 0.751135
5 0.751155
6 0.751156
7 0.751157
8 0.751157
9 0.751157
According to the documentation for Pandas, the pandas.Series.ewm() function receives an adjust parameter, which defaults to TRUE. When adjust == TRUE, the exponentially weighted moving average from pandas.Series.ewm.mean() is calculated through weights, not recursively. Naturally, this affects the standard deviation output as well. See this Github issue and this question for more info.
Here's a vectorized solution in R:
ewmsd <- function(x, alpha) {
n <- length(x)
sapply(
1:n,
function(i, x, alpha) {
y <- x[1:i]
m <- length(y)
weights <- (1 - alpha)^((m - 1):0)
ewma <- sum(weights * y) / sum(weights)
bias <- sum(weights)^2 / (sum(weights)^2 - sum(weights^2))
ewmsd <- sqrt(bias * sum(weights * (y - ewma)^2) / sum(weights))
},
x = x,
alpha = alpha
)
}
I have a question concerning the computation of a double integral in R. Maybe it is not the best software package to try numerical integration, but we are heavily relying on its stochastic optimisation packages (the function to be optimised is very non-trivial, with lots of local minima), so we cannot switch to MATLAB or other packages.
The problem is the following: it takes a whale of a time to compute the double integral using nested integrate functions, and several times more (!) using the hcubature approach from the cubature package. I tried the first solution from this answer (using hcubature from the cubature package), but it made the timing even worse; besides that, infinite integration limits are not supported, and the integration chokes for (-100, 100) interval already. With the second solution (quad2d from pracma package), the timing is great, but the computation result is way off!
The single integral is computed quite quickly (e.g., if the double integrals are commented out, it takes only 0.2 seconds to compute the value of the function, which is tolerable).
Here is a heavily simplified version of the function for the MWE (just to illustrate the point of integration).
library(cubature)
library(pracma)
# Generate some artificial data to try this function on
set.seed(100)
n <- 200
r <- rnorm(n, 0.0004, 0.01)
# Log-likelihood function accepts 3 parameters:
# [1] shape of positive shocks, [2] shape of negative shocks, [3] DoF of Student's distribution for jumps
parm <- c(6, 7, 10)
LL <- function(parm, cub = "default") {
shapes <- parm[1:2]
studdof <- parm[3]
# For simplification, generate some dynamic series
set.seed(101)
sigmaeps <- rgamma(n, shape=shapes[1], rate=1000)
sigmaeta <- rgamma(n, shape=shapes[2], rate=1000)
lambdas <- rgamma(n, shape=10, rate=80)+1
probs <- sapply(lambdas, function(x) dpois(0:2, lambda=x))
probs <- sweep(probs, 2, colSums(probs), FUN="/") # Normalising the probabilities
# Reserving memory for 3 series of density
fw0 <- rep(NA, n)
fw1 <- rep(NA, n)
fw2 <- rep(NA, n)
for (t in 2:n) {
integ0 <- function(e) { # First integrand for 0 jumps
1/sigmaeta[t] * dgamma(-(r[t]-sigmaeps[t]*e)/sigmaeta[t], shape=shapes[2]) * # Density of negative shocks
dgamma(e, shape=shapes[1]) # Density of positive shocks
}
integ1 <- function(e, g) { # Double integrand for 1 jump
1/sigmaeta[t] * dgamma(-(r[t]-sigmaeps[t]*e-1*g)/sigmaeta[t], shape=shapes[2]) * # Density of negative shocks
dgamma(e, shape=shapes[1]) * # Density of positive shocks
dt(g, df = studdof)/1 # Density of jump intensity
}
integ2 <- function(e, g) { # Double integrand for 2 jumps
1/sigmaeta[t] * dgamma(-(r[t]-sigmaeps[t]*e-2*g)/sigmaeta[t], shape=shapes[2]) * # Density of negative shocks
dgamma(e, shape=shapes[1]) * # Density of positive shocks
dt(g, df = studdof)/2 # Density of jump intensity
}
# Wrappers for cubature because they need vector inputs
wrapper1 <- function(x) integ1(x[1], x[2])
wrapper2 <- function(x) integ2(x[1], x[2])
# Single integral that is not a problem
fw0[t] <- integrate(integ0, 0, Inf)$value
if (cub=="cubature") {
# 2D CUBATURE FROM cubature PACKAGE
fw1[t] <- hcubature(wrapper1, c(0, -20), c(20, 20))$integral
fw2[t] <- hcubature(wrapper2, c(0, -20), c(20, 20))$integral
} else if (cub=="prac2d") {
# 2D CUBATURE FROM pracma PACKAGE
fw1[t] <- quad2d(integ1, 0, 100, -100, 100)
fw2[t] <- quad2d(integ2, 0, 100, -100, 100)
} else if (cub=="default") {
# DOUBLE INTEGRALS FROM BUILT-IN INTEGRATE
fw1[t] <- integrate(function(g) { sapply(g, function(g) { integrate(function(e) integ1(e, g), 0, Inf)$value }) }, -Inf, Inf)$value
fw2[t] <- integrate(function(g) { sapply(g, function(g) { integrate(function(e) integ2(e, g), 0, Inf)$value }) }, -Inf, Inf)$value
}
if (!t%%10) print(t)
}
fw <- fw0*probs[1, ] + fw1*probs[2, ] + fw2*probs[3, ]
fw <- log(fw[2:n])
fw[is.nan(fw)] <- -Inf
slfw <- sum(fw)
print(paste0("Point: ", paste(formatC(parm, 4, format="e", digits=3), collapse=" "), ", LL: ", round(slfw, 2)))
return(slfw)
}
system.time(LL(parm, cub="default"))
# 13 seconds
# "Point: 6.000e+00 7.000e+00 1.000e+01, LL: 247.78"
system.time(LL(parm, cub="cubature"))
# 29 seconds, the result is slightly off
# "Point: 6.000e+00 7.000e+00 1.000e+01, LL: 241.7"
system.time(LL(parm, cub="prac2d"))
# 0.5 seconds, the result is way off
# "Point: 6.000e+00 7.000e+00 1.000e+01, LL: 223.25"
(Ideally, integ1(e, g) and integ2(e, g) should be integrated over [0, Inf) w.r.t. e and over (-Inf, Inf) w.r.t. g.)
Parallelisation is done at a higher level (i.e., the stochastic optimiser is computing the values of this likelihood function in parallel), so it is essential that this function run as quickly as possible on a single core.
Is there any way to speed up the computation of this double integral?
Here is a wrapper for hcubature which I use to allow infinite limits:
hcubature.inf <- function() {
cl <- match.call()
cl[[1L]] <- quote(cubature::hcubature)
if(all(is.finite(c(lowerLimit,upperLimit)))) return(eval.parent(cl))
# convert limits to new coordinates to incorporate infinities
cl[['upperLimit']] <- atan(upperLimit)
cl[['lowerLimit']] <- atan(lowerLimit)
# wrap the function with the coordinate transformation
# update argument to hcubature with our function
f <- match.fun(f)
cl[['f']] <- if(!vectorInterface)
function(x, ...) f(tan(x), ...) / prod(cos(x))^2
else
function(x, ...) f(tan(x), ...) / rep(apply(cos(x), 2, prod)^2, each=fDim)
eval.parent(cl)
}
formals(hcubature.inf) <- formals(cubature::hcubature)
Then you should vectorize the integrands:
vwrapper1 <- function(x) as.matrix(integ1(x[1,], x[2,]))
vwrapper2 <- function(x) as.matrix(integ2(x[1,], x[2,]))
And integrate:
if (cub=="cubature.inf") {
fw1[t] <- hcubature.inf(vwrapper1, c(0, -Inf), c(Inf, Inf), vectorInterface=TRUE)$integral
fw2[t] <- hcubature.inf(vwrapper2, c(0, -Inf), c(Inf, Inf), vectorInterface=TRUE)$integral
} else if (cub=="cubature") {
...
You get a value of 242.83 in about half the time of your default method.
Suppose I have the following data frame
set.seed(36)
n <- 300
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
For each value of d$y<=300, I have to create a variable res in which the numerator is the sum of the indicator (d$x <= d$y[i]) and the denominator is the sum of the indicator (d$y >= d$y[i]). I have written the codes in for loop:
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
But my concern is when the number of observations of x and y is large, that is, the number of rows of the data frame increases, the for loop will work slowly. Additionally, if I simulate data 1000 times and each time run the for loop, the program will be inefficient.
What can be the more efficient solution of the code?
This depends on d already being sorted as it is:
# example data
set.seed(36)
n <- 1e5
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
My suggestion (thanks to #alexis_laz for the denominator):
system.time(res3 <- {
xs <- sort(d$x) # sorted x
yt <- d$y[d$y <= 300] # truncated y
num = findInterval(yt, xs)
den = length(d$y) - match(yt, d$y) + 1L
num/den
})
# user system elapsed
# 0 0 0
OP's approach:
system.time(res <- {
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
res
})
# user system elapsed
# 50.77 1.13 52.10
# verify it matched
all.equal(res,res3) # TRUE
#d.b's approach:
system.time(res2 <- {
numerator = rowSums(outer(d$y, d$x, ">="))
denominator = rowSums(outer(d$y, d$y, "<="))
res2 = numerator/denominator
res2 = res2[d$y <= 300]
res2
})
# Error: cannot allocate vector of size 74.5 Gb
# ^ This error is common when using outer() on large-ish problems
Vectorization. Generally, tasks are faster in R if they can be vectorized. The key functions related to ordered vectors have confusing names (findInterval, sort, order and cut), but fortunately they all work on vectors.
Continuous vs discrete. The match above should be a fast way to compute the denominator whether the data is continuous or has mass points / repeating values. If the data is continuous (and so has no repeats), the denominator can just be seq(length(xs), length = length(yt), by=-1). If it is fully discrete and has a lot of repetition (like the example here), there might be some way to make that faster as well, maybe like one of these:
den2 <- inverse.rle(with(rle(yt), list(
values = length(xs) - length(yt) + rev(cumsum(rev(lengths))),
lengths = lengths)))
tab <- unname(table(yt))
den3 <- rep(rev(cumsum(rev(tab))) + length(xs) - length(yt), tab)
# verify
all.equal(den,den2) # TRUE
all.equal(den,den3) # TRUE
findInterval will still work for the numerator for continuous data. It's not ideal for the repeated-values case considered here I guess (since we're redundantly finding the interval for many repeated yt values). Similar ideas for speeding that up likely apply.
Other options. As #chinsoon suggested, the data.table package might be a good fit if findInterval is too slow, since it has a lot of features focused on sorted data, but it's not obvious to me how to apply it here.
Instead of running loop, generate all the numerator and denominator at once. This also allows you to keep track of which res is associated with which x and y. Later, you can keep only the ones you want.
You can use outer for element wise comparison between vectors.
numerator = rowSums(outer(d$y, d$x, ">=")) #Compare all y against all x
denominator = rowSums(outer(d$y, d$y, "<=")) #Compare all y against itself
res2 = numerator/denominator #Obtain 'res' for all rows
#I would first 'cbind' res2 to d and only then remove the ones for 'y <=300'
res2 = res2[d$y <= 300] #Keep only those 'res' that you want
Since this is using rowSums, this should be faster.
I've been mystified by the R quantile function all day.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
From the docs:
Q[i](p) = (1 - gamma) x[j] + gamma
x[j+1],
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
where 1 <= i <= 9, (j-m)/n <= p <
(j-m+1)/ n, x[j] is the jth order
statistic, n is the sample size, and m
is a constant determined by the sample
quantile type. Here gamma depends on
the fractional part of g = np+m-j.
So, how calculate j? m?
For the continuous sample quantile
types (4 through 9), the sample
quantiles can be obtained by linear
interpolation between the kth order
statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta
+ 1),
where α and β are constants determined
by the type. Further, m = alpha + p(1
- alpha - beta), and gamma = g.
Now I'm really lost. p, which was a constant before, is now apparently a function.
So for Type 7 quantiles, the default...
Type 7
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R source code
or type: quantile.default
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
Also, that last sentence is just wrong. It should read
Here gamma depends on the fractional part of np+m, g = np+m-j
As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.
As for the last bit, R is just stating what S uses.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.
Consider the question of what to do when your observations don't fall on quantiles exactly.
The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".
So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.
The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.
See what happens with the different types when you put in vectors like 1:5 and 1:6.
(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).
I believe the R help documentation is clear after the revisions noted in #RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.
To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.
You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.
quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
, type = 7, ...){
if(is.factor(x)) { #worry about non-numeric data
if(!is.ordered(x) || ! type %in% c(1L, 3L))
stop("factors are not allowed")
lx <- levels(x)
} else lx <- NULL
if (na.rm){
x <- x[!is.na(x)]
} else if (anyNA(x)){
stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
}
eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
stop("'probs' outside [0,1]")
#####################################
# here is where terms really used in default type==7 situation get defined
n <- length(x) #how many observations are in sample?
if(na.p <- any(!p.ok)) { # set aside NA & NaN
o.pr <- probs
probs <- probs[p.ok]
probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
}
np <- length(probs) #how many quantiles are you computing?
if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
if(type == 7) { # be completely back-compatible
index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
lo <- floor(index) #this is the observed order statistic just below each quantile
hi <- ceiling(index) #above
x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort,
#and it only guarantees that sorting is "right" at these order statistics, important for large vectors
#ties are not broken and tied elements just stay in their original order
qs <- x[lo] #the values associated with the "floor" order statistics
i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value
#this is the difference between the order statistic and the available ranks, i think
h <- (index - lo)[i] # > 0 by construction
## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high.
# h is the arithmetic difference between the desired order statistic amd the available ranks
#interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd.
# This means having a more extreme 99th observation doesn't matter when computing the .75 quantile
###################################
# print all of these things
cat("floor pos=", c(lo))
cat("\nceiling pos=", c(hi))
cat("\nfloor values= ", c(x[lo]))
cat( "\nwhich floors not targets? ", c(i))
cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
cat( "\nadjustment values= ", c(h))
cat("\nquantile estimates:")
}else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
} else {n * probs} # m = 0
j <- floor(nppm)
h <- switch(type,
(nppm > j), # type 1
((nppm > j) + 1)/2, # type 2
(nppm != j) | ((j %% 2L) == 1L)) # type 3
} else{
## Types 4 through 9 are continuous sample qs.
switch(type - 3,
{a <- 0; b <- 1}, # type 4
a <- b <- 0.5, # type 5
a <- b <- 0, # type 6
a <- b <- 1, # type 7 (unused here)
a <- b <- 1 / 3, # type 8
a <- b <- 3 / 8) # type 9
## need to watch for rounding errors here
fuzz <- 4 * .Machine$double.eps
nppm <- a + probs * (n + 1 - a - b) # n*probs + m
j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
h <- nppm - j
if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
x <- sort(x, partial =
unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
)
x <- c(x[1L], x[1L], x, x[n], x[n])
## h can be zero or one (types 1 to 3), and infinities matter
#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]
## also h*x might be invalid ... e.g. Dates and ordered factors
qs <- x[j+2L]
qs[h == 1] <- x[j+3L][h == 1]
other <- (0 < h) & (h < 1)
if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
}
} else {
qs <- rep(NA_real_, np)}
if(is.character(lx)){
qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
if(names && np > 0L) {
names(qs) <- format_perc(probs)
}
if(na.p) { # do this more elegantly (?!)
o.pr[p.ok] <- qs
names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
names(o.pr)[p.ok] <- names(qs)
o.pr
} else qs
}
####################
# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)
#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)
# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.
#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))