R: draw from a vector using custom probability function - r

Forgive me if this has been asked before (I feel it must have, but could not find precisely what I am looking for).
Have can I draw one element of a vector of whole numbers (from 1 through, say, 10) using a probability function that specifies different chances of the elements. If I want equal propabilities I use runif() to get a number between 1 and 10:
ceiling(runif(1,1,10))
How do I similarly sample from e.g. the exponential distribution to get a number between 1 and 10 (such that 1 is much more likely than 10), or a logistic probability function (if I want a sigmoid increasing probability from 1 through 10).
The only "solution" I can come up with is first to draw e6 numbers from the say sigmoid distribution and then scale min and max to 1 and 10 - but this looks clumpsy.
UPDATE:
This awkward solution (and I dont feel it very "correct") would go like this
#Draw enough from a distribution, here exponential
x <- rexp(1e3)
#Scale probs to e.g. 1-10
scaler <- function(vector, min, max){
(((vector - min(vector)) * (max - min))/(max(vector) - min(vector))) + min
}
x_scale <- scaler(x,1,10)
#And sample once (and round it)
round(sample(x_scale,1))
Are there not better solutions around ?

I believe sample() is what you are looking for, as #HubertL mentioned in the comments. You can specify an increasing function (e.g. logit()) and pass the vector you want to sample from v as an input. You can then use the output of that function as a vector of probabilities p. See the code below.
logit <- function(x) {
return(exp(x)/(exp(x)+1))
}
v <- c(seq(1,10,1))
p <- logit(seq(1,10,1))
sample(v, 1, prob = p, replace = TRUE)

Related

Find out which percentile a number has [duplicate]

Using R, it is trivial to calculate the quantiles for given probabilities in a sampled distribution:
x <- rnorm(1000, mean=4, sd=2)
quantile(x, .9) # results in 6.705755
However, I can't find an easy way to do the inverse—calculate the probability for a given quantile in the sample x. The closest I've come is to use pnorm() with the same mean and standard deviation I used when creating the sample:
pnorm(5, mean=4, sd=2) # results in 0.6914625
However, because this is calculating the probability from the full normal distribution, and not the sample x, it's not entirely accurate.
Is there a function that essentially does the inverse of quantile()? Something that essentially lets me do the same thing as pnorm() but with a sample? Something like this:
backwards_quantile(x, 5)
I've found the ecdf() function, but can't figure out a way to make it result in a single probability instead of a full equation object.
ecdf returns a function: you need to apply it.
f <- ecdf(x)
f( quantile(x,.91) )
# Equivalently:
ecdf(x)( quantile(x,.91) )
Just for convenience, this function helps:
quantInv <- function(distr, value) ecdf(distr)(value)
set.seed(1)
x <- rnorm(1000, mean=4, sd=2)
quantInv(x, c(4, 5, 6.705755))
[1] 0.518 0.685 0.904
You more or less have the answer yourself. When you want to write
backwards_quantile(x, 5)
just write
ecdf(x)(5)
This corresponds to the inverse of quantile() with type=1. However, if you want other types (I favour the NIST standard, corresponding to Excel's Percentile.exc, which is type=6), you have more work to do.
In these latter cases, consider which use you are going to put it to. If all you want is to plot it, for instance, then consider
yVals<-seq(0,1,0.01)
plot(quantile(x,yVals,type=6))
But if you want the inverse for a single value, like 5, then you need to write a solving function to find the P that makes
quantile(x,P,type=6) = 5
For instance this, which uses binary search between the extreme values of x:
inverse_quantile<-function(x,y,d=0.01,type=1) {
A<-min(x)
B<-max(x)
k<-(log((B-A)/d)/log(2))+1
P=0.5
for (i in 1:k) {
P=P+ifelse((quantile(x,P,type=type)<y),2^{-i-1},-2^{-i-1})
}
P
}
So if you wanted the type 4 quantile of your set x for the number 5, with precision 0.00001, then you would write
inverse_quantile<-function(x,5,d=0.00001,type=4)

Draw random numbers from distribution within a certain range

I want to draw a number of random variables from a series of distributions. However, the values returned have to be no higher than a certain threshold.
Let’s say I want to use the gamma distribution and the threshold is 10 and I need n=100 random numbers. I now want 100 random number between 0 and 10. (Say scale and shape are 1.)
Getting 100 random variables is obviously easy...
rgamma(100, shape = 1, rate = 1)
But how can I accomplish that these values range from 0 to 100?
EDIT
To make my question clearer. The 100 values drawn should be scaled beween 0 and 10. So that the highest drawn value is 10 and the lowest 0. Sorry if this was not clear...
EDIT No2
To add some context to the random numbers I need: I want to draw "system repair times" that follow certain distributions. However, within the system simulation there is a binomial probability of repairs beeing "simple" (i.e. short repair time) and "complicated" (i.e. long repair time). I now need a function that provides "short repair times" and one that provides "long repair times". The threshold would be the differentiation between short and long repair times. Again, I hope this makes my question a little clearer.
This is not possible with a gamma distribution.
The support of a distribution determine the range of sample data drawn from it.
As the support of the gamma distribution is (0,inf) this is not possible.(see https://en.wikipedia.org/wiki/Gamma_distribution).
If you really want to have a gamma distribution take a rejection sampling approach as Alex Reynolds suggests.
Otherwise look for a distribution with a bounded/finite support (see https://en.wikipedia.org/wiki/List_of_probability_distributions)
e.g. uniform or binomial
Well, fill vector with rejection, untested code
v <- rep(-1.0, 100)
k <- 1
while (TRUE) {
q <- rgamma(1, shape=1, rate=1)
if (q > 0.0 && q < 100) {
v[k] <- q
k<-k+1
if (k>100)
break
}
}
I'm not sure you can keep the properties of the original distribution, imposing additional conditions... But something like this will do the job:
Filter(function(x) x < 10, rgamma(1000,1,1))[1:100]
For the scaling - beware, the outcome will not follow the original distribution (but there's no way to do it, as the other answers pointed out):
# rescale numeric vector into (0, 1) interval
# clip everything outside the range
rescale <- function(vec, lims=range(vec), clip=c(0, 1)) {
# find the coeficients of transforming linear equation
# that maps the lims range to (0, 1)
slope <- (1 - 0) / (lims[2] - lims[1])
intercept <- - slope * lims[1]
xformed <- slope * vec + intercept
# do the clipping
xformed[xformed < 0] <- clip[1]
xformed[xformed > 1] <- clip[2]
xformed
}
# this is the requested data
10 * rescale(rgamma(100,1,1))
Use truncdist package. It truncates any distribution between upper and lower bounds.
Hope that helped.

Finding the Maximum of a Function with numerical derivatives in R

I wish to numerically find the maximum of the function multiplied by Beta 3 shown on p346 of the following link when tau=30:
http://www.ssc.upenn.edu/~fdiebold/papers/paper49/Diebold-Li.pdf
They give the answer on p347 as 0.0609.
I would like to confirm this numerically in R. I.e. to take the derivative and find the value where it reaches zero.
library(numDeriv)
x <- 30
testh <- function(lambda){ ((1-exp(-lambda*30))/(lambda*30)) - exp(-lambda*30) }
grad_h <- function(lambda){
val <- grad(testh, lambda)
return(val^2)
}
OptLam <- optimize(f=grad_h, interval=c(0.0001,120), tol=0.0000000000001)
I take the square of the gradient as I want the minimum to be at zero.
Unfortunately, the answer comes back as Lambda=120!! With lambda at 120 the value of the objective function is 5.36e-12.
By working by hand I can func a lower value of the numerical derivative that is closer to zero (it is also close to the analytical value given above):
grad_h(0.05977604)
## [1] 4.24494e-12
Why is the function above not finding this lower value? I have set the tolerance very high so it should be able to find such this optimal value?
Is it possible to correct the existing method so that it gives the correct answer?
Is there a better way to find the maximum gradient of a function numerically in R?
For example is there an optimizer that looks for zero rather than trying to find a minimum of maximum?
You can use uniroot to find where the derivative is 0. This might work for you,
grad_h <- function(lambda){
val=grad(testh,lambda)
return(val)
}
## The root
res <- uniroot(grad_h, c(0,120), tol=1e-10)
## see it
ls <- seq(0.001, 1, length=1000)
plot(ls, testh(ls), col="salmon")
abline(v=res$root, col="steelblue", lwd=2, lty=2)
text(x=res$root, y=testh(res$root),
labels=sprintf("(%f, %s)", res$root,
format(testh(res$root), scientific = T)), adj=-0.1)

How do I calculate the probability for a given quantile in R?

Using R, it is trivial to calculate the quantiles for given probabilities in a sampled distribution:
x <- rnorm(1000, mean=4, sd=2)
quantile(x, .9) # results in 6.705755
However, I can't find an easy way to do the inverse—calculate the probability for a given quantile in the sample x. The closest I've come is to use pnorm() with the same mean and standard deviation I used when creating the sample:
pnorm(5, mean=4, sd=2) # results in 0.6914625
However, because this is calculating the probability from the full normal distribution, and not the sample x, it's not entirely accurate.
Is there a function that essentially does the inverse of quantile()? Something that essentially lets me do the same thing as pnorm() but with a sample? Something like this:
backwards_quantile(x, 5)
I've found the ecdf() function, but can't figure out a way to make it result in a single probability instead of a full equation object.
ecdf returns a function: you need to apply it.
f <- ecdf(x)
f( quantile(x,.91) )
# Equivalently:
ecdf(x)( quantile(x,.91) )
Just for convenience, this function helps:
quantInv <- function(distr, value) ecdf(distr)(value)
set.seed(1)
x <- rnorm(1000, mean=4, sd=2)
quantInv(x, c(4, 5, 6.705755))
[1] 0.518 0.685 0.904
You more or less have the answer yourself. When you want to write
backwards_quantile(x, 5)
just write
ecdf(x)(5)
This corresponds to the inverse of quantile() with type=1. However, if you want other types (I favour the NIST standard, corresponding to Excel's Percentile.exc, which is type=6), you have more work to do.
In these latter cases, consider which use you are going to put it to. If all you want is to plot it, for instance, then consider
yVals<-seq(0,1,0.01)
plot(quantile(x,yVals,type=6))
But if you want the inverse for a single value, like 5, then you need to write a solving function to find the P that makes
quantile(x,P,type=6) = 5
For instance this, which uses binary search between the extreme values of x:
inverse_quantile<-function(x,y,d=0.01,type=1) {
A<-min(x)
B<-max(x)
k<-(log((B-A)/d)/log(2))+1
P=0.5
for (i in 1:k) {
P=P+ifelse((quantile(x,P,type=type)<y),2^{-i-1},-2^{-i-1})
}
P
}
So if you wanted the type 4 quantile of your set x for the number 5, with precision 0.00001, then you would write
inverse_quantile<-function(x,5,d=0.00001,type=4)

Root mean square deviation on binned GAM results using R

Background
A PostgreSQL database uses PL/R to call R functions. An R call to calculate Spearman's correlation looks as follows:
cor( rank(x), rank(y) )
Also in R, a naïve calculation of a fitted generalized additive model (GAM):
data.frame( x, fitted( gam( y ~ s(x) ) ) )
Here x represents the years from 1900 to 2009 and y is the average measurement (e.g., minimum temperature) for that year.
Problem
The fitted trend line (using GAM) is reasonably accurate, as you can see in the following picture:
The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.
Possible Solution
One way to improve the accuracy of the correlation is to use a root mean square error (RMSE) calculation on binned data.
Questions
Q.1. How would you implement the RMSE calculation on the binned data to get a correlation (between 0 and 1) of GAM's fit to the measurements, in the R language?
Q.2. Is there a better way to find the accuracy of GAM's fit to the data, and if so, what is it (e.g., root mean square deviation)?
Attempted Solution 1
Call the PL/R function using the observed amounts and the model (GAM) amounts: correlation_rmse := climate.plr_corr_rmse( v_amount, v_model );
Define plr_corr_rmse as follows (where o and m represent the observed and modelled data): CREATE OR REPLACE FUNCTION climate.plr_corr_rmse(
o double precision[], m double precision[])
RETURNS double precision AS
$BODY$
sqrt( mean( o - m ) ^ 2 )
$BODY$
LANGUAGE 'plr' VOLATILE STRICT
COST 100;
The o - m is wrong. I'd like to bin both data sets by calculating the mean of every 5 data points (there will be at most 110 data points). For example:
omean <- c( mean(o[1:5]), mean(o[6:10]), ... )
mmean <- c( mean(m[1:5]), mean(m[6:10]), ... )
Then correct the RMSE calculation as:
sqrt( mean( omean - mmean ) ^ 2 )
How do you calculate c( mean(o[1:5]), mean(o[6:10]), ... ) for an arbitrary length vector in an appropriate number of bins (5, for example, might not be ideal for only 67 measurements)?
I don't think hist is suitable here, is it?
Attempted Solution 2
The following code will solve the problem, however it drops data points from the end of the list (to make the list divisible by 5). The solution isn't ideal as the number "5" is rather magical.
while( length(o) %% 5 != 0 ) {
o <- o[-length(o)]
}
omean <- apply( matrix(o, 5), 2, mean )
What other options are available?
Thanks in advance.
You say that:
The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.
You could calculate the correlation between the fitted values and the measured values:
cor(y,fitted(gam(y ~ s(x))))
I don't see why you want to bin your data, but you could do it as follows:
mean.binned <- function(y,n = 5){
apply(matrix(c(y,rep(NA,(n - (length(y) %% n)) %% n)),n),
2,
function(x)mean(x,na.rm = TRUE))
}
It looks a bit ugly, but it should handle vectors whose length is not a multiple of the binning length (i.e. 5 in your example).
You also say that:
One way to improve the accuracy of the
correlation is to use a root mean
square error (RMSE) calculation on
binned data.
I don't understand what you mean by this. The correlation is a factor in determining the mean squared error - for example, see equation 10 of Murphy (1988, Monthly Weather Review, v. 116, pp. 2417-2424). But please explain what you mean.

Resources