Simulate from an (arbitrary) continuous probability distribution [duplicate] - r

This question already has answers here:
How do I best simulate an arbitrary univariate random variate using its probability function?
(4 answers)
Closed 8 years ago.
For a normalized probability density function defined on the real line, for example
p(x) = (2/pi) * (1/(exp(x)+exp(-x))
(this is just an example; the solution should apply for any continuous PDF we can define) is there a package in R to simulate from the distribution? I am aware of R's built-in simulators for many distributions.
I could numerically compute the inverse cumulative distribution function at a set of quantiles, store them in a table, and use the table to map from uniform variates to variates from the desired distribution. Is there already a package that does this?

Here is a way using the distr package, which is designed for this.
library(distr)
p <- function(x) (2/pi) * (1/(exp(x)+exp(-x))) # probability density function
dist <-AbscontDistribution(d=p) # signature for a dist with pdf ~ p
rdist <- r(dist) # function to create random variates from p
set.seed(1) # for reproduceable example
X <- rdist(1000) # sample from X ~ p
x <- seq(-10,10, .01)
hist(X, freq=F, breaks=50, xlim=c(-5,5))
lines(x,p(x),lty=2, col="red")
You can of course also do this is base R using the methodology described in any one of the links in the comments.

If this is the function that you're dealing with, you could just take the integral (or, if you're rusty on your integration rules like me, you could use a tool like Wolfram Alpha to do it for you).
In the case of the function provided, you can simulate with:
draw.val <- function(numdraw) log(tan(pi*runif(numdraw)/2))
A histogram confirms that we're sampling correctly:
hist(draw.val(10000), breaks=100, probability=T)
x <- seq(-10, 10, .001)
lines(x, (2/pi) * (1/(exp(x)+exp(-x))), col="red")

Related

Empirical CDF vs Theoretical CDF in R

I want to check the "probability integral transform" theorem using R.
Let's suppose X is an exponential random variable with lambda = 5.
I want to check that the random variable U = F_X = 1 - exp(-5*X) has a uniform (0,1) distribution.
How would you do it?
I would start in this way:
nsample <- 1000
lambda <- 5
x <- rexp(nsample, lambda) #1000 exponential observation
u <- 1- exp(-lambda*x) #CDF of x
Then I need to find the CDF of u and compare it with the CDF of a Uniform (0,1).
For the empirical CDF of u I could use the ECDF function:
ECDF_u <- ecdf(u) #empirical CDF of U
Now I should create the theoretical CDF of Uniform (0,1) and plot it on the same graph of the ECDF in order to compare the two graphs.
Can you help with the code?
You are almost there. You don't need to compute the ECDF yourself – qqplot will take care of this. All you need is your sample (u) and data from the distribution you want to check against. The lazy (and not quite correct) approach would be to check against a random sample drawn from a uniform distribution:
qqplot(runif(nsample), u)
But of course, it is better to plot against the theoretical quantiles:
# the actual plot
qqplot( qunif(ppoints(length(u))), u )
# add a line
qqline(u, distribution=qunif, col='red', lwd=2)
Looks pretty good to me.

chi-square distribution R

Trying to fit a chi_square distribution using fitdistr() in R. Documentation on this is here (and not very useful to me): https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html
Question 1: chi_df below has the following output: 3.85546875 (0.07695236). What is the second number? The Variance or standard deviation?
Question 2: fitdistr generates 'k' defined by the Chi-SQ distribution. How do I fit the data so I get the scaling constant 'A'? I am dumbly using lines 14-17 below. Obviously not good.
Question 3: Is the Chi-SQ distribution only defined for a certain x-range? (Variance is defined as 2K, while mean = k. This must require some constrained x-range... Stats question not programming...)
nnn = 1000;
## Generating a chi-sq distribution
chii <- rchisq(nnn,4, ncp = 0);
## Plotting Histogram
chi_hist <- hist(chii);
## Fitting. Gives probability density which must be scaled.
chi_df <- fitdistr(chii,"chi-squared",start=list(df=3));
chi_k <- chi_df[[1]][1];
## Plotting a fitted line:
## Spanning x-length of chi-sq data
x_chi_fit <- 1:nnn*((max(chi_hist[[1]][])-min(chi_hist[[1]][]))/nnn);
## Y data using eqn for probability function
y_chi_fit <- (1/(2^(chi_k/2)*gamma(chi_k/2)) * x_chi_fit^(chi_k/2-1) * exp(-x_chi_fit/2));
## Normalizing to the peak of the histogram
y_chi_fit <- y_chi_fit*(max(chi_hist[[2]][]/max(y_chi_fit)));
## Plotting the line
lines(x_chi_fit,y_chi_fit,lwd=2,col="green");
Thanks for your help!
As commented above, ?fitdistr says
An object of class ‘"fitdistr"’, a list with four components,
...
sd: the estimated standard errors,
... so that parenthesized number is the standard error of the parameter.
The scale parameter doesn't need to be estimated; you need either to scale by the width of your histogram bins or just use freq=FALSE when drawing your histogram. See code below.
The chi-squared distribution is defined on the non-negative reals, which makes sense since it's the distribution of a squared standard Normal (this is a statistical, not a programming question).
Set up data:
nnn <- 1000
## ensure reproducibility; not a big deal in this case,
## but good practice
set.seed(101)
## Generating a chi-sq distribution
chii <- rchisq(nnn,4, ncp = 0)
Fitting.
library(MASS)
## use method="Brent" based on warning
chi_df <- fitdistr(chii,"chi-squared",start=list(df=3),
method="Brent",lower=0.1,upper=100)
chi_k <- chi_df[[1]][1]
(For what it's worth, it looks like there might be a bug in the print method for fitdistr when method="Brent" is used. You could also use method="BFGS" and wouldn't need to specify bounds ...)
Histograms
chi_hist <- hist(chii,breaks=50,col="gray")
## scale by N and width of histogram bins
curve(dchisq(x,df=chi_k)*nnn*diff(chi_hist$breaks)[1],
add=TRUE,col="green")
## or plot histogram already scaled to a density
chi_hist <- hist(chii,breaks=50,col="gray",freq=FALSE)
curve(dchisq(x,df=chi_k),add=TRUE,col="green")

Plot density of a distribution

Is it possible to plot in R the density function of a distribution?
For example suppose that I want to plot the density function of a Normal(0,5) or a Gamma(5,5).
f <- function(x) dnorm(x,0,5)
g <- function(x) dgamma(x,5,5)
par(mfrow=c(1,2)) # set up gaphics window for 2 plots
plot(f,xlim=c(-15,15),main="N(0,5)")
plot(g,xlim=c(0,3),main="Gamma(5,5)")
In R the distribution functions follow a pattern. For instance, for the normal distribution, the PDF is dnorm(...), the CDF is pnorm(...), the inverse CDF (quantile function) is qnorm(...), and the random number generator is rnorm(...).
One thing to watch out for is that R's convention for the arguments does not necessarily match what you find, for instance, on Wikipedia. For instance the arguments to dgamma(...) are x, shape, and rate, not x, k, and theta.

How to get Inverse CDF (kernel) in R?

Is there any function in R which will calculate the inverse kernel(i am considering normal) CDF for a particular alpha(0,1).
I have found quantile but I am not sure how it works.
Thanks
We can integrate to get the cdf and we can use a root finding algorithm to invert the cdf. First we'll want to interpolate the output from density.
set.seed(10000)
x <- rnorm(1000, 10, 13)
pdf <- density(x)
# Interpolate the density
f <- approxfun(pdf$x, pdf$y, yleft=0, yright=0)
# Get the cdf by numeric integration
cdf <- function(x){
integrate(f, -Inf, x)$value
}
# Use a root finding function to invert the cdf
invcdf <- function(q){
uniroot(function(x){cdf(x) - q}, range(x))$root
}
which gives
med <- invcdf(.5)
cdf(med)
#[1] 0.5000007
This could definitely be improved upon. One issue is that I don't guarantee that the cdf is always less than or equal to 1 (and if you check the cdf for values larger than max(x) you might get something like 1.00097. But I'm too tired to fix that now. This should give a decent start.
An alternative approach would be to use log-spline density estimation rather than kernel density estimation. Look at the 'logspline' package. With logspline density estimations you get CDF (plogspline) and inverse CDF (qlogspline) functions.

Plotting fitted values vs observed ones in R or winbugs

I want to plot the fitted values versus the observed ones and want to put straight line showing the goodness of fit. However, I do not want to use abline() because I did not calculate the fitted values using lm command as my I used a model that R does not cover. I calculated the coefficients and used them to calculate the fitted values. So, what can I do to obtain such a plot in R or in winbugs?
Here is what I want
Still no data provided, but maybe this simple example using the curve function will inform the process:
x <- 1:10
y <- 2+ 3*(1:10) + rnorm(10)
plot(1:10, y)
curve( 2+3*x, 0, 10, add=TRUE)
Note to new R users. the expression y_i = 1 - xbeta + delta_i + e_i would fail in R in part because the x and beta are not separated by an operator. But if you do understand R's matrix syntax it might be a very compact expression even if "X" were multidimensional. All of htis depends on the specifics which we are so far lacking.

Resources