I've been mystified by the R quantile function all day.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
From the docs:
Q[i](p) = (1 - gamma) x[j] + gamma
x[j+1],
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
where 1 <= i <= 9, (j-m)/n <= p <
(j-m+1)/ n, x[j] is the jth order
statistic, n is the sample size, and m
is a constant determined by the sample
quantile type. Here gamma depends on
the fractional part of g = np+m-j.
So, how calculate j? m?
For the continuous sample quantile
types (4 through 9), the sample
quantiles can be obtained by linear
interpolation between the kth order
statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta
+ 1),
where α and β are constants determined
by the type. Further, m = alpha + p(1
- alpha - beta), and gamma = g.
Now I'm really lost. p, which was a constant before, is now apparently a function.
So for Type 7 quantiles, the default...
Type 7
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R source code
or type: quantile.default
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
Also, that last sentence is just wrong. It should read
Here gamma depends on the fractional part of np+m, g = np+m-j
As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.
As for the last bit, R is just stating what S uses.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.
Consider the question of what to do when your observations don't fall on quantiles exactly.
The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".
So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.
The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.
See what happens with the different types when you put in vectors like 1:5 and 1:6.
(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).
I believe the R help documentation is clear after the revisions noted in #RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.
To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.
You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.
quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
, type = 7, ...){
if(is.factor(x)) { #worry about non-numeric data
if(!is.ordered(x) || ! type %in% c(1L, 3L))
stop("factors are not allowed")
lx <- levels(x)
} else lx <- NULL
if (na.rm){
x <- x[!is.na(x)]
} else if (anyNA(x)){
stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
}
eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
stop("'probs' outside [0,1]")
#####################################
# here is where terms really used in default type==7 situation get defined
n <- length(x) #how many observations are in sample?
if(na.p <- any(!p.ok)) { # set aside NA & NaN
o.pr <- probs
probs <- probs[p.ok]
probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
}
np <- length(probs) #how many quantiles are you computing?
if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
if(type == 7) { # be completely back-compatible
index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
lo <- floor(index) #this is the observed order statistic just below each quantile
hi <- ceiling(index) #above
x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort,
#and it only guarantees that sorting is "right" at these order statistics, important for large vectors
#ties are not broken and tied elements just stay in their original order
qs <- x[lo] #the values associated with the "floor" order statistics
i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value
#this is the difference between the order statistic and the available ranks, i think
h <- (index - lo)[i] # > 0 by construction
## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high.
# h is the arithmetic difference between the desired order statistic amd the available ranks
#interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd.
# This means having a more extreme 99th observation doesn't matter when computing the .75 quantile
###################################
# print all of these things
cat("floor pos=", c(lo))
cat("\nceiling pos=", c(hi))
cat("\nfloor values= ", c(x[lo]))
cat( "\nwhich floors not targets? ", c(i))
cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
cat( "\nadjustment values= ", c(h))
cat("\nquantile estimates:")
}else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
} else {n * probs} # m = 0
j <- floor(nppm)
h <- switch(type,
(nppm > j), # type 1
((nppm > j) + 1)/2, # type 2
(nppm != j) | ((j %% 2L) == 1L)) # type 3
} else{
## Types 4 through 9 are continuous sample qs.
switch(type - 3,
{a <- 0; b <- 1}, # type 4
a <- b <- 0.5, # type 5
a <- b <- 0, # type 6
a <- b <- 1, # type 7 (unused here)
a <- b <- 1 / 3, # type 8
a <- b <- 3 / 8) # type 9
## need to watch for rounding errors here
fuzz <- 4 * .Machine$double.eps
nppm <- a + probs * (n + 1 - a - b) # n*probs + m
j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
h <- nppm - j
if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
x <- sort(x, partial =
unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
)
x <- c(x[1L], x[1L], x, x[n], x[n])
## h can be zero or one (types 1 to 3), and infinities matter
#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]
## also h*x might be invalid ... e.g. Dates and ordered factors
qs <- x[j+2L]
qs[h == 1] <- x[j+3L][h == 1]
other <- (0 < h) & (h < 1)
if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
}
} else {
qs <- rep(NA_real_, np)}
if(is.character(lx)){
qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
if(names && np > 0L) {
names(qs) <- format_perc(probs)
}
if(na.p) { # do this more elegantly (?!)
o.pr[p.ok] <- qs
names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
names(o.pr)[p.ok] <- names(qs)
o.pr
} else qs
}
####################
# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)
#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)
# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.
#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))
Related
Consider the following ODE with boundary condition
y'[x] = (a - 1)*y[x]/x
y[1] = a*b
where a > 0, b > 0, and x > 0.
I want to solve this ODE numerically in R using the command ode from the R package deSolve.
This is, I want to calculate the solution y[x] for x>0
Questions: (i) I do not know how to include "x" in the denominator in the model equation. (ii) I also don't know how to specify the initial condition.
Disclaimer: I know this equation has an analytical solution, but I am trying to compare it with the numerical solution.
So far, I have unsuccessfully tried:
library(deSolve)
difeq <- function(x, y, parms) {
a <- parms[1]
b <- parms[2]
# model equations
dy <- y*(a - 1)/***x*** # HOW TO SPECIFY x?
# result
return( list(dy) )
}
params <- c(a = 2, b = 1)
init <- c(y = params[1]*params[2]) # HOW TO SPECIFY THE INITIAL CONDITION?
times <- seq(0,5, by = 0.1)
out <- ode(init, times, difeq, params)
plot(out)
Looking at the equation, I would say that x is equivalent to time t. Package deSolve uses time in the denominator because this is quite common, but it is not limited to time dependent systems. It can also be something else, e.g. a spatial component, just set x=t, that works exactly the same.
Note also to avoid 0 in the denominator.
library(deSolve)
difeq <- function(x, y, parms) {
a <- parms[1]
b <- parms[2]
# model equations
dy <- y * (a - 1) / x
# result
return(list(y=dy))
}
params <- c(a = 2, b = 1)
init <- c(y = unname(params[1]*params[2]))
times <- seq(1, 5, by = 0.1)
out <- ode(init, times, difeq, params)
## just rename "time" to "x"
colnames(out)[1] <- "x"
head(out)
To get the initial value at x -> 0 one may use an optimizer, run the system backwards (see separate answer) or use another solver (see CRAN Task view).
time y
1 1.0 2.0
2 1.1 2.2
3 1.2 2.4
4 1.3 2.6
5 1.4 2.8
6 1.5 3.0
7 1.6 3.2
8 1.7 3.4
Another part of the question was, how to specify the initial condition at x>0. As this is a somewhat additional question to the technical part how to include x in the denominator, I will give it as separate answer.
The approach uses a backward simulation. As negative time steps are not supported in deSolve a workaround can be used:
Instead of running the simulation from 1 to 0 backwards, we run it forward from -1 to 0.
Because of a pole at x=0, we use a small value close to zero.
In the model, we change the sign of the derivative, and as x appears also in the model, we multiply it with the sign too.
Note: this is explicitly included here to show the general approach for didactical reasons. In the special case here, signs will of course cancel out.
In the script below, we first start integration from the known initial value at x=1. Then we do the "backward" integration to estimate the initial value close to zero. Finally we run the model with the new initial value for the whole range of x and compare the results.
library(deSolve)
difeq <- function(x, y, parms, sign = 1) {
with(as.list(parms),{
dy <- sign * y * (a - 1) / (sign * x)
return(list(y=dy))
})
}
## initial simulation starting at x = 1
params <- c(a = 2, b = 1, sign = 1)
init <- c(y = params[["a"]] * params[["b"]])
times <- seq(1, 5, by = 0.1)
out1 <- ode(init, times, difeq, params)
## backwards simulation
close_to_zero <- 1e-16 # numerical precision is limited, do not decrease this more
times <- c(-1, -close_to_zero)
out0 <- ode(init, times, difeq, params, sign = -1)
## forward simulation starting close to zero
times <- c(close_to_zero, seq(0.1, 10, 0.1))
init <- c(y = out0[[nrow(out0), 2]])
cat("y[", close_to_zero,"] =", init, "\n")
out2 <- ode(init, times, difeq, params)#, hmax=0.01)
## comparison of the initial simulation with an extended time period
plot(out2, out1, lwd=c(1, 5), lty=c("solid", "dotted"), xlab="x")
I have to use recursion to produce pseudo random numbers. For fixed values a, b and c, I need to calculate:
x_n+1 = (a * x_n + c) modulo 2^b. Random numbers are obtained by the function R_n = x_n / (2^b). I need to save these R_n values to make a histogram. How can I make a function in R that uses it's previous values x_n to produce x_n+1? I have made a start with my code, it's listed below.
a=5
b=4
c=3
k=10000
random <- function(x) {
if(x<k){
x = (a*x+c)%%2^b
k++
}
}
Here's a thought for starters,
random <- function(a = 5, b = 4, c = 3, k = 10000, x0 = 1) {
x <- x0 # or some other sane default
function(n = 1) {
newx <- Reduce(function(oldx, ign) (a*oldx + c) %% (2^b), seq_len(n),
init = x, accumulate = TRUE)[-1]
# if (x >= k)? do something else
if (length(newx)) {
x <<- newx[length(newx)]
k <<- k + n
}
newx
}
}
The premise is that the random function is a setup function that returns a function. This inner function has its a, b, c, k, and previous x variables stored within it.
thisran <- random()
thisran()
# [1] 8
thisran(3)
# [1] 11 10 5
I haven't studied creating PRNG in depth, but I'm inferring that x0 here is effectively your seed. I'm not certain why you had a if (x<k) conditional in your function; since k was never used otherwise, just incremented, I'm thinking it only serves as a termination indicator for your PRNG (so it is not infinite).
If need be, the current k value (and other variables, for that matter) can be peeked-at with
get("k", environment(thisran))
# [1] 10003
BTW: the use of Reduce might seem like an unnecessary complication, but it enables the ran(n) functionality, similar to other PRNGs in R. That is, one can do runif(7) for seven random numbers, and I thought it would be useful to do that here. The use of Reduce is required in that case since each calculation depends on the results from the previous calculation, so a sample replicate or sapply would not work (without some contrived coding that I wanted to avoid).
I'm trying to set a function in R to compute the Greatest Common RATIONAL Divisor of a vector. So I'm not working with a vector of integers, but of numerics. And from this vector I would like to automatically determine the highest numeric that can divide all the values in the vector and result in a integer. Which is very difficult with floating-point arithmetic used in R.
To give an example, lets say that I would like to find the highest common rational divisor of 5, 0.37 and 0.02. It's 0.01, but how can I automate this, taking into account that with floating-point arithmetic 0.37 will for instance be considered like 0.3700000000000000000000000005271 in R (something like that) ? With that problem I can't easily compute the lowest order of decimal (10^-2 in my example), or if you prefer the lowest one that has significance for me as user.
The fact that the result in itself will have floating-point-like error (e.g. 0.0100000000000000000000008465 in place of 0.01) is not a problem. However ideally the solution should be the most general possible (capable of working with vectors having extremely different values (10^20 and 10^-20 for instance).
I got a solution. So the basic idea is to divide everything by the smallest value, to multiply by integers, until every value is made of integers. This is controlled via the floor() function, which allows to have explicit control over the level of tolerance. I added some control to the amount of possibilities it tests, to make it kind of efficient, but I'm not sure this is the best method. Anyway I'll put this in the StratigrapheR package
divisor <- function(x, tolerance = 8, relative = T, tries = 4, speak = T)
{
if(!(isFALSE(relative) | isTRUE(relative))) {
stop("The 'relative' parameter should be TRUE or FALSE'")
}
if(!(isFALSE(speak) | isTRUE(speak))) {
stop("The 'speak' parameter should be TRUE or FALSE'")
}
x <- unique(x)
x <- x[x != 0]
# Divide by smallest
mx <- min(x)
d <- x/mx
if(!relative){
# Test if tolerance is of lower order than the smallest
if(-log10(mx) > tolerance) {
stop(paste("If 'relative' is FALSE, the smallest value (zero excepted)",
"should of higher order than the order",
"defined by the 'tolerance' parameter"))
}
}
# Test the dispersion of values
general_tolerance <- 15 # Order of digits affected by floating-point
if(log10(max(x)) >= (general_tolerance - tolerance)){
stop(paste("The range of 'x' values is too large to find a meaningful",
"greatest common rational divisor.",
"To solve this problem you can change the values in x or",
" lower the 'tolerance' parameter (i.e. the",
"tolerance for floating-point aritmetics):in the later case be",
"critical of the result."))
}
# Test and remove values that are multiples of the smallest value
remain1 <- (d - floor(d + 10^-(tolerance - 1)))
if(!relative) remain1 <- remain1 * mx
rzero1 <- abs(remain1) < 10^-tolerance
d <- d[!rzero1]
if(length(d) == 0) {
if(!relative){
res <- round(mx, tolerance)
} else {
res <- signif(mx, tolerance)
}
} else {
# Multiply d [x/min(x)] by integers, and test if
# this returns only integers within tolerance
ld <- length(d)
try_order_OLD <- 0
try_order_i <- 6 - ceiling(log10(ld))
for(i in seq_len(as.integer(tries))){
if(speak) {
print(paste("Try ",i,": 10^",try_order_i,
" possibilities tested", sep = ""))
}
t <- 1:(10^try_order_i)
t <- t[-(1:(10^try_order_OLD))]
lt <- length(t)
tmat <- matrix(rep(t,ld), ncol = ld)
dmat <- matrix(rep(d, lt), ncol = ld, byrow = T)
test <- dmat * tmat
remain2 <- (test - floor(test + 10^-(tolerance - 1)))
remain2 <- remain2/tmat
if(!relative) remain2 <- remain2 * mx
rzero2 <- abs(remain2) < 10^-tolerance
rzero2 <- matrix(as.integer(rzero2), ncol = ld)
test[which(rowSums(rzero2) == ld),]
res <- mx/t[which(rowSums(rzero2) == ld)[1]]
if(!is.na(res)) break
try_order_OLD <- try_order_i
try_order_i <- try_order_i + 1
}
}
return(res)
}
Using Base R, I was wondering if I could determine the 95% area under the curve denoted as posterior below?
More specifically, I want to move from the mode (the green dashed line) toward the tails and then stop when I have covered 95% of the curve area. Desired are the x-axis values that are the limits of this 95% area as shown in the picture below?
prior = function(x) dbeta(x, 15.566, 7.051)
likelihood = function(x) dbinom(55, 100, x)
posterior = function(x) prior(x)*likelihood(x)
mode = optimize(posterior, interval = c(0, 1), maximum = TRUE, tol = 1e-12)[[1]]
curve(posterior, n = 1e4)
P.S In other words, it is highly desirable if such an Interval be the shortest 95% interval possible.
Symmetric distribution
Even though OP's example was not exactly symmetric, it is close enough - and useful to start there since the solution is much simpler.
You can use a combination of integrate and optimize. I wrote this as a custom function, but note that if you use this in other situations you may have to rethink the bounds for searching the quantile.
# For a distribution with a single peak, find the symmetric!
# interval that contains probs probability. Search over 'range'.
f_quan <- function(fun, probs, range=c(0,1)){
mode <- optimize(fun, interval = range, maximum = TRUE, tol = 1e-12)[[1]]
total_area <- integrate(fun, range[1], range[2])[[1]]
O <- function(d){
parea <- integrate(fun, mode-d, mode+d)[[1]] / total_area
(probs - parea)^2
}
# Bounds for searching may need some adjustment depending on the problem!
o <- optimize(O, c(0,range[2]/2 - 1E-02))[[1]]
return(c(mode-o, mode+o))
}
Use it like this,
f <- f_quan(posterior, 0.95)
curve(posterior, n = 1e4)
abline(v=f, col="blue", lwd=2, lty=3)
gives
Asymmetric distribution
In the case of an asymmetric distribution, we have to search two points that meet the criterium that P(a < x < b) = Prob, where Prob is some desired probability. Since there are infinitely many intervals (a,b) that meet this, OP suggested finding the shortest one.
Important in the solution is the definition of a domain, the region where we want to search (we cannot use -Inf, Inf, so the user has to set this to reasonable values).
# consider interval (a,b) on the x-axis
# integrate our function, normalize to total area, to
# get the total probability in the interval
prob_ab <- function(fun, a, b, domain){
totarea <- integrate(fun, domain[1], domain[2])[[1]]
integrate(fun, a, b)[[1]] / totarea
}
# now given a and the probability, invert to find b
invert_prob_ab <- function(fun, a, prob, domain){
O <- function(b, fun, a, prob){
(prob_ab(fun, a, b, domain=domain) - prob)^2
}
b <- optimize(O, c(a, domain[2]), a = a, fun=fun, prob=prob)$minimum
return(b)
}
# now find the shortest interval by varying a
# Simplification: don't search past the mode, otherwise getting close
# to the right-hand side of domain will give serious trouble!
prob_int_shortest <- function(fun, prob, domain){
mode <- optimize(fun, interval = domain, maximum = TRUE, tol = 1e-12)[[1]]
# objective function to be minimized: the width of the interval
O <- function(a, fun, prob, domain){
b <- invert_prob_ab(fun, a, prob, domain)
b - a
}
# shortest interval that meets criterium
abest <- optimize(O, c(0,mode), fun=fun, prob=prob, domain=domain)$minimum
# now return the interval
b <- invert_prob_ab(fun, abest, prob, domain)
return(c(abest,b))
}
Now use the above code like this. I use a very asymmetric function (just assume mydist is actually some complicated pdf, not the dgamma).
mydist <- function(x)dgamma(x, shape=2)
curve(mydist(x), from=0, to=10)
abline(v=prob_int_shortest(mydist, 0.9, c(0,10)), lty=3, col="blue", lwd=2)
In this example I set domain to (0,10), since clearly the interval must be in there somewhere. Note that using a very large value like (0, 1E05) does not work, because integrate has trouble with long sequences of near-zeroes. Again, for your situation, you will have to adjust the domain (unless someone has a better idea!).
Here is a solution making use of the Trapezoidal rule. You will note that the solution provided by #Remko is far superior, however this solution hopefully adds some pedagogical value as it illuminates how complicated problems can be reduced to simple geometry, arithmetic, and basic programming constructs such as for loops.
findXVals <- function(lim, p) {
## (1/p) is the precision
## area of a trapezoid
trapez <- function(h1, h2, w) {(h1 + h2) * w / 2}
yVals <- posterior((1:(p - 1))/p)
m <- which.max(yVals)
nZ <- which(yVals > 1/p)
b <- m + 1
e <- m - 1
a <- f <- m
area <- 0
myRng <- 1:(length(nZ)-1)
totArea <- sum(trapez(yVals[nZ[myRng]], yVals[nZ[myRng+1]], 1/p))
targetArea <- totArea * lim
while (area < targetArea) {
area <- area + trapez(yVals[a], yVals[b], 1/p) + trapez(yVals[e], yVals[f], 1/p)
a <- b
b <- b + 1
f <- e
e <- e - 1
}
c((a - 1)/p, (f + 1)/p)
}
findXVals(.95, 10^5)
[1] 0.66375 0.48975
I'm trying to use R to estimate E[u(X)] where u is a utility function and X is a random variable. More specifically, I want to be able to rank E[u(X)] and E[u(Y)] for two random variables X and Y -- only the ranking matters.
My problem is that u(x) = -exp(-sigma * x) for some sigma > 0, and this converges very rapidly to zero. So I have many cases where I expect, say, E[u(X)] > E[u(Y)], but because they are so close to zero, my simulation cannot distinguish them.
Does anyone have any advice for me?
I am only interested in ranking the two expected utilities, so u(x) can be replaced by any u.tilde(x) = a * u(x) + b, where a > 0 and b can be any number.
Below is an example where X and Y are both normal (in which case I think there is a closed form solution, but pretend X and Y have complicated distributions that I can only simulate from).
get.u <- function(sigma=1) {
stopifnot(sigma > 0)
utility <- function(x) {
return(-exp(-sigma * x))
}
return(utility)
}
u <- get.u(sigma=1)
curve(u, from=0, to=10) # Converges very rapidly to zero
n <- 10^4
x <- rnorm(n, 10^4, sd=10)
y <- rnorm(n, 10^4, sd=10^3)
mean(u(x)) == mean(u(y)) # Returns True (they're both 0), but I expect E[u(x)] > E[u(y)]
## An example of replacing u with a*u + b
get.scaled.u <- function(sigma=1) {
stopifnot(sigma > 0) # Risk averse
utility <- function(x) {
return(-exp(-sigma * x + sigma * 10^4))
}
return(utility)
}
u <- get.scaled.u(sigma=1)
mean(u(x)) > mean(u(y)) # True as desired
x <- rnorm(n, 10^4, sd=10^3)
y <- rnorm(n, 10^4, sd=2*10^3)
mean(u(x)) > mean(u(y)) # False again -- they're both -Inf
Is finding a clever way to scale u the correct way to deal with this problem? For example, suppose X and Y both have bounded support -- if I know the bounds, how can I scale u to guarantee that a*u + b will be neither too close to -Inf, nor too close to zero?
Edit: I didn't know about multiple precision packages. Rmpfr is helpful:
library(Rmpfr)
x.precise <- mpfr(x, 100)
y.precise <- mpfr(y, 100)
mean(u(x.precise)) > mean(u(y.precise)) # True