Convert uniform draws to normal distributions with known mean and std in R - r

I apply the sensitivity package in R. In particular, I want to use sobolroalhs as it uses a sampling procedure for inputs that allow for evaluations of models with a large number of parameters. The function samples uniformly [0,1] for all inputs. It is stated that desired distributions need to be obtained as follows
####################
# Test case: dealing with non-uniform distributions
x <- sobolroalhs(model = NULL, factors = 3, N = 1000, order =1, nboot=0)
# X1 follows a log-normal distribution:
x$X[,1] <- qlnorm(x$X[,1])
# X2 follows a standard normal distribution:
x$X[,2] <- qnorm(x$X[,2])
# X3 follows a gamma distribution:
x$X[,3] <- qgamma(x$X[,3],shape=0.5)
# toy example
toy <- function(x){rowSums(x)}
y <- toy(x$X)
tell(x, y)
print(x)
plot(x)
I have non-zero mean and standard deviations for some input parameter that I want to sample out of a normal distribution. For others, I want to uniformly sample between a defined range (e.g. [0.03,0.07] instead [0,1]). I tried using built in R functions such as
SA$X[,1] <- rnorm(1000, mean = 579, sd = 21)
but I am afraid this procedure messes up the sampling design of the package and resulted in odd results for the sensitivity indices. Hence, I think I need to adhere for the uniform draw of the sobolroalhs function in which and use the sampled value between [0, 1] when drawing out of the desired distribution (I think as density draw?). Does this make sense to anyone and/or does anyone know how I could sample out of the right distributions following the syntax from the package description?

You can specify mean and sd in qnorm. So modify lines like this:
x$X[,2] <- qnorm(x$X[,2])
to something like this:
x$X[,2] <- qnorm(x$X[,2], mean = 579, sd = 21)
Similarly, you could use the min and max parameters of qunif to get values in a given range.
Of course, it's also possible to transform standard normals or uniforms to the ones you want using things like X <- 579 + 21*Z or Y <- 0.03 + 0.04*U, where Z is a standard normal and U is standard uniform, but for some distributions those transformations aren't so simple and using the q* functions can be easier.

Related

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

Find out which percentile a number has [duplicate]

Using R, it is trivial to calculate the quantiles for given probabilities in a sampled distribution:
x <- rnorm(1000, mean=4, sd=2)
quantile(x, .9) # results in 6.705755
However, I can't find an easy way to do the inverseā€”calculate the probability for a given quantile in the sample x. The closest I've come is to use pnorm() with the same mean and standard deviation I used when creating the sample:
pnorm(5, mean=4, sd=2) # results in 0.6914625
However, because this is calculating the probability from the full normal distribution, and not the sample x, it's not entirely accurate.
Is there a function that essentially does the inverse of quantile()? Something that essentially lets me do the same thing as pnorm() but with a sample? Something like this:
backwards_quantile(x, 5)
I've found the ecdf() function, but can't figure out a way to make it result in a single probability instead of a full equation object.
ecdf returns a function: you need to apply it.
f <- ecdf(x)
f( quantile(x,.91) )
# Equivalently:
ecdf(x)( quantile(x,.91) )
Just for convenience, this function helps:
quantInv <- function(distr, value) ecdf(distr)(value)
set.seed(1)
x <- rnorm(1000, mean=4, sd=2)
quantInv(x, c(4, 5, 6.705755))
[1] 0.518 0.685 0.904
You more or less have the answer yourself. When you want to write
backwards_quantile(x, 5)
just write
ecdf(x)(5)
This corresponds to the inverse of quantile() with type=1. However, if you want other types (I favour the NIST standard, corresponding to Excel's Percentile.exc, which is type=6), you have more work to do.
In these latter cases, consider which use you are going to put it to. If all you want is to plot it, for instance, then consider
yVals<-seq(0,1,0.01)
plot(quantile(x,yVals,type=6))
But if you want the inverse for a single value, like 5, then you need to write a solving function to find the P that makes
quantile(x,P,type=6) = 5
For instance this, which uses binary search between the extreme values of x:
inverse_quantile<-function(x,y,d=0.01,type=1) {
A<-min(x)
B<-max(x)
k<-(log((B-A)/d)/log(2))+1
P=0.5
for (i in 1:k) {
P=P+ifelse((quantile(x,P,type=type)<y),2^{-i-1},-2^{-i-1})
}
P
}
So if you wanted the type 4 quantile of your set x for the number 5, with precision 0.00001, then you would write
inverse_quantile<-function(x,5,d=0.00001,type=4)

Fit distribution to given frequency values in R

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.
Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)
Update.
To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:
7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.
Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:
# Loop to simulate values
set.values <- c()
for(i in 1:length(sample)){
set.values <<- c(set.values, rep(i, times = sample[i]))
}
hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)
and use fitdistr on the set.values:
f2 <- fitdistr(set.values, 'weibull')
f2
Why I think it is incorrect way and why I am looking for a better solution in R?
in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)
Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.
Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.
If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.
library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22
s.fit <- sample[3:23]
t.fit <- t.sample[3:23]
wx <- function(param) {
res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
return(res)
}
minwx <- function(param){
v <- s.fit-wx(param)
sqrt(sum(v*v))
}
p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))
popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)
plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)
You can directly calculate the maximum likelihood parameters, as described here.
# Defining the error of the implicit function
k.diff <- function(k, vec){
x2 <- seq(length(vec))
abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2),
w = x2^k*sample))
}
# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min
# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)
# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))
Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
f<-fitdistr(sample, 'weibull')
f
If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.
For your example this would look like this:
ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
ks
The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.
Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.
f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)
f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1])
hist(z)

Generate a random number from a density object (or more broadly from a set of numbers)

Let's say I have a set of numbers that I suspect come from the same distribution.
set.seed(20130613)
x <- rcauchy(10)
I would like a function that randomly generates a number from that same unknown distribution. One approach I have thought of is to create a density object and then get the CDF from that and take the inverse CDF of a random uniform variable (see Wikipedia).
den <- density(x)
#' Generate n random numbers from density() object
#'
#' #param n The total random numbers to generate
#' #param den The density object from which to generate random numbers
rden <- function(n, den)
{
diffs <- diff(den$x)
# Making sure we have equal increments
stopifnot(all(abs(diff(den$x) - mean(diff(den$x))) < 1e-9))
total <- sum(den$y)
den$y <- den$y / total
ydistr <- cumsum(den$y)
yunif <- runif(n)
indices <- sapply(yunif, function(y) min(which(ydistr > y)))
x <- den$x[indices]
return(x)
}
rden(1, den)
## [1] -0.1854121
My questions are the following:
Is there a better (or built into R) way to generate a random number from a density object?
Are there any other ideas on how to generate a random number from a set of numbers (besides sample)?
To generate data from a density estimate you just randomly choose one of the original data points and add a random "error" piece based on the kernel from the density estimate, for the default of "Gaussian" this just means choose a random element from the original vector and add a random normal with mean 0 and sd equal to the bandwidth used:
den <- density(x)
N <- 1000
newx <- sample(x, N, replace=TRUE) + rnorm(N, 0, den$bw)
Another option is to fit a density using the logspline function from the logspline package (uses a different method of estimating a density), then use the rlogspline function in that package to generate new data from the estimated density.
If all you need is to draw values from your existing pool of numbers, then sample is the way to go.
If you want to draw from the presumed underlying distribution, then use density , and fit that to your presumed distribution to get the necessary coefficients (mean, sd, etc.), and use the appropriate R distribution function.
Beyond that, I'd take a look at Chapter7.3 ("rejection method") of Numerical Recipes in C for ways to "selectively" sample according to any distribution. The code is simple enough to be easily translated into R .
My bet is someone already has done so and will post a better answer than this.
Greg Snow's answer was helpful to me, and I realized that the output of the density function has all the data needed to create random numbers from the input distribution. Building on his example, you can do the following to get random values using the density output.
x <- rnorm(100) # or any numeric starting vector you desire
dens <- density(x)
N <- 1000
newx <- sample(x = dens$x, N, prob = dens$y, replace=TRUE) + rnorm(N, 0, dens$bw)
You can even create a simple random number generating function
rdensity <- function(n, dens) {
return(sample(x = dens$x, n, prob = dens$y, replace=TRUE) + rnorm(n, 0, dens$bw))
}

How do I calculate the probability for a given quantile in R?

Using R, it is trivial to calculate the quantiles for given probabilities in a sampled distribution:
x <- rnorm(1000, mean=4, sd=2)
quantile(x, .9) # results in 6.705755
However, I can't find an easy way to do the inverseā€”calculate the probability for a given quantile in the sample x. The closest I've come is to use pnorm() with the same mean and standard deviation I used when creating the sample:
pnorm(5, mean=4, sd=2) # results in 0.6914625
However, because this is calculating the probability from the full normal distribution, and not the sample x, it's not entirely accurate.
Is there a function that essentially does the inverse of quantile()? Something that essentially lets me do the same thing as pnorm() but with a sample? Something like this:
backwards_quantile(x, 5)
I've found the ecdf() function, but can't figure out a way to make it result in a single probability instead of a full equation object.
ecdf returns a function: you need to apply it.
f <- ecdf(x)
f( quantile(x,.91) )
# Equivalently:
ecdf(x)( quantile(x,.91) )
Just for convenience, this function helps:
quantInv <- function(distr, value) ecdf(distr)(value)
set.seed(1)
x <- rnorm(1000, mean=4, sd=2)
quantInv(x, c(4, 5, 6.705755))
[1] 0.518 0.685 0.904
You more or less have the answer yourself. When you want to write
backwards_quantile(x, 5)
just write
ecdf(x)(5)
This corresponds to the inverse of quantile() with type=1. However, if you want other types (I favour the NIST standard, corresponding to Excel's Percentile.exc, which is type=6), you have more work to do.
In these latter cases, consider which use you are going to put it to. If all you want is to plot it, for instance, then consider
yVals<-seq(0,1,0.01)
plot(quantile(x,yVals,type=6))
But if you want the inverse for a single value, like 5, then you need to write a solving function to find the P that makes
quantile(x,P,type=6) = 5
For instance this, which uses binary search between the extreme values of x:
inverse_quantile<-function(x,y,d=0.01,type=1) {
A<-min(x)
B<-max(x)
k<-(log((B-A)/d)/log(2))+1
P=0.5
for (i in 1:k) {
P=P+ifelse((quantile(x,P,type=type)<y),2^{-i-1},-2^{-i-1})
}
P
}
So if you wanted the type 4 quantile of your set x for the number 5, with precision 0.00001, then you would write
inverse_quantile<-function(x,5,d=0.00001,type=4)

Resources