Say I have a simple array, with a corresponding probability distribution.
library(stats)
data <- c(0,0.08,0.15,0.28,0.90)
pdf_of_data <- density(data, from= 0, to=1, bw=0.1)
Is there a way I could generate another set of data using the same distribution. As the operation is probabilistic, it need not exactly match the initial distribution anymore, but will be just generated from it.
I did have success finding a simple solution on my own. Thanks!
Your best bet is to generate the empirical cumulative density function, approximate the inverse, and then transform the input.
The compound expression looks like
random.points <- approx(
cumsum(pdf_of_data$y)/sum(pdf_of_data$y),
pdf_of_data$x,
runif(10000)
)$y
Yields
hist(random.points, 100)
From the examples in the documentation of ?density you (almost) get the answer.
So, something like this should do it:
library("stats")
data <- c(0,0.08,0.15,0.28,0.90)
pdf_of_data <- density(data, from= 0, to=1, bw=0.1)
# From the example.
N <- 1e6
x.new <- rnorm(N, sample(data, size = N, replace = TRUE), pdf_of_data$bw)
# Histogram of the draws with the distribution superimposed.
hist(x.new, freq = FALSE)
lines(pdf_of_data)
You can just reject the draws outside your interval as in rejection sampling.
Alternatively, you can use the algorithm described in the link.
To draw from the curve:
sample(pdf_of_data$x, 1e6, TRUE, pdf_of_data$y)
Related
I have no sample and I'd like to compute the variance, mean, median, and mode of a distribution which I only have a vector with it's density and a vector with it's support. Is there an easy way to compute this statistics in R with this information?
Suppose that I only have the following information:
Support
Density
sum(Density) == 1 #TRUE
length(Support)==length(Density)# TRUE
You have to do weighted summations
F.e., starting with #Johann example
set.seed(312345)
x = rnorm(1000, mean=10, sd=1)
x_support = density(x)$x
x_density = density(x)$y
plot(x_support, x_density)
mean(x)
prints
[1] 10.00558
and what, I believe, you're looking for
m = weighted.mean(x_support, x_density)
computes mean as weighted mean of values, producing output
10.0055796130192
There are weighted.sd, weighted.sum functions which should help you with other quantities you're looking for.
Plot
If you don't need a mathematical solution, and an empirical one is all right, you can achieve a pretty good approximation by sampling.
Let's generate some data:
set.seed(6854684)
x = rnorm(50,mean=10,sd=1)
x_support = density(x)$x
x_density = density(x)$y
# see our example:
plot(x_support, x_density )
# the real mean of x
mean(x)
Now to 'reverse' the process we generate a large sample from that density distribution:
x_sampled = sample(x = x_support, 1000000, replace = T, prob = x_density)
# get the statistics
mean(x_sampled)
median(x_sampled)
var(x_sampled)
etc...
I apply the sensitivity package in R. In particular, I want to use sobolroalhs as it uses a sampling procedure for inputs that allow for evaluations of models with a large number of parameters. The function samples uniformly [0,1] for all inputs. It is stated that desired distributions need to be obtained as follows
####################
# Test case: dealing with non-uniform distributions
x <- sobolroalhs(model = NULL, factors = 3, N = 1000, order =1, nboot=0)
# X1 follows a log-normal distribution:
x$X[,1] <- qlnorm(x$X[,1])
# X2 follows a standard normal distribution:
x$X[,2] <- qnorm(x$X[,2])
# X3 follows a gamma distribution:
x$X[,3] <- qgamma(x$X[,3],shape=0.5)
# toy example
toy <- function(x){rowSums(x)}
y <- toy(x$X)
tell(x, y)
print(x)
plot(x)
I have non-zero mean and standard deviations for some input parameter that I want to sample out of a normal distribution. For others, I want to uniformly sample between a defined range (e.g. [0.03,0.07] instead [0,1]). I tried using built in R functions such as
SA$X[,1] <- rnorm(1000, mean = 579, sd = 21)
but I am afraid this procedure messes up the sampling design of the package and resulted in odd results for the sensitivity indices. Hence, I think I need to adhere for the uniform draw of the sobolroalhs function in which and use the sampled value between [0, 1] when drawing out of the desired distribution (I think as density draw?). Does this make sense to anyone and/or does anyone know how I could sample out of the right distributions following the syntax from the package description?
You can specify mean and sd in qnorm. So modify lines like this:
x$X[,2] <- qnorm(x$X[,2])
to something like this:
x$X[,2] <- qnorm(x$X[,2], mean = 579, sd = 21)
Similarly, you could use the min and max parameters of qunif to get values in a given range.
Of course, it's also possible to transform standard normals or uniforms to the ones you want using things like X <- 579 + 21*Z or Y <- 0.03 + 0.04*U, where Z is a standard normal and U is standard uniform, but for some distributions those transformations aren't so simple and using the q* functions can be easier.
I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302
I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.
Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)
Update.
To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:
7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.
Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:
# Loop to simulate values
set.values <- c()
for(i in 1:length(sample)){
set.values <<- c(set.values, rep(i, times = sample[i]))
}
hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)
and use fitdistr on the set.values:
f2 <- fitdistr(set.values, 'weibull')
f2
Why I think it is incorrect way and why I am looking for a better solution in R?
in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)
Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.
Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.
If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.
library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22
s.fit <- sample[3:23]
t.fit <- t.sample[3:23]
wx <- function(param) {
res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
return(res)
}
minwx <- function(param){
v <- s.fit-wx(param)
sqrt(sum(v*v))
}
p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))
popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)
plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)
You can directly calculate the maximum likelihood parameters, as described here.
# Defining the error of the implicit function
k.diff <- function(k, vec){
x2 <- seq(length(vec))
abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2),
w = x2^k*sample))
}
# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min
# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)
# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))
Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
f<-fitdistr(sample, 'weibull')
f
If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.
For your example this would look like this:
ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
ks
The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.
Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.
f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)
f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1])
hist(z)
Let's say I have a set of numbers that I suspect come from the same distribution.
set.seed(20130613)
x <- rcauchy(10)
I would like a function that randomly generates a number from that same unknown distribution. One approach I have thought of is to create a density object and then get the CDF from that and take the inverse CDF of a random uniform variable (see Wikipedia).
den <- density(x)
#' Generate n random numbers from density() object
#'
#' #param n The total random numbers to generate
#' #param den The density object from which to generate random numbers
rden <- function(n, den)
{
diffs <- diff(den$x)
# Making sure we have equal increments
stopifnot(all(abs(diff(den$x) - mean(diff(den$x))) < 1e-9))
total <- sum(den$y)
den$y <- den$y / total
ydistr <- cumsum(den$y)
yunif <- runif(n)
indices <- sapply(yunif, function(y) min(which(ydistr > y)))
x <- den$x[indices]
return(x)
}
rden(1, den)
## [1] -0.1854121
My questions are the following:
Is there a better (or built into R) way to generate a random number from a density object?
Are there any other ideas on how to generate a random number from a set of numbers (besides sample)?
To generate data from a density estimate you just randomly choose one of the original data points and add a random "error" piece based on the kernel from the density estimate, for the default of "Gaussian" this just means choose a random element from the original vector and add a random normal with mean 0 and sd equal to the bandwidth used:
den <- density(x)
N <- 1000
newx <- sample(x, N, replace=TRUE) + rnorm(N, 0, den$bw)
Another option is to fit a density using the logspline function from the logspline package (uses a different method of estimating a density), then use the rlogspline function in that package to generate new data from the estimated density.
If all you need is to draw values from your existing pool of numbers, then sample is the way to go.
If you want to draw from the presumed underlying distribution, then use density , and fit that to your presumed distribution to get the necessary coefficients (mean, sd, etc.), and use the appropriate R distribution function.
Beyond that, I'd take a look at Chapter7.3 ("rejection method") of Numerical Recipes in C for ways to "selectively" sample according to any distribution. The code is simple enough to be easily translated into R .
My bet is someone already has done so and will post a better answer than this.
Greg Snow's answer was helpful to me, and I realized that the output of the density function has all the data needed to create random numbers from the input distribution. Building on his example, you can do the following to get random values using the density output.
x <- rnorm(100) # or any numeric starting vector you desire
dens <- density(x)
N <- 1000
newx <- sample(x = dens$x, N, prob = dens$y, replace=TRUE) + rnorm(N, 0, dens$bw)
You can even create a simple random number generating function
rdensity <- function(n, dens) {
return(sample(x = dens$x, n, prob = dens$y, replace=TRUE) + rnorm(n, 0, dens$bw))
}