Generate stochastic random deviates from a density object with R - r

I have a density object dd created like this:
x1 <- rnorm(1000)
x2 <- rnorm(1000, 3, 2)
x <- rbind(x1, x2)
dd <- density(x)
plot(dd)
Which produces this very non-Gaussian distribution:
alt text http://www.cerebralmastication.com/wp-content/uploads/2009/09/nongaus.png
I would ultimately like to get random deviates from this distribution similar to how rnorm gets deviates from a normal distribution.
The way I am trying to crack this is to get the CDF of my kernel and then get it to tell me the variate if I pass it a cumulative probability (inverse CDF). That way I can turn a vector of uniform random variates into draws from the density.
It seems like what I am trying to do should be something basic that others have done before me. Is there a simple way or a simple function to do this? I hate reinventing the wheel.
FWIW I found this R Help article but I can't grok what they are doing and the final output does not seem to produce what I am after. But it could be a step along the way that I just don't understand.
I've considered just going with a Johnson distribution from the suppdists package but Johnson won't give me the nice bimodal hump which my data has.

Alternative approach:
sample(x, n, replace = TRUE)

This is just a mixture of normals. So why not something like:
rmnorm <- function(n,mean, sd,prob) {
nmix <- length(mean)
if (length(sd)!=nmix) stop("lengths should be the same.")
y <- sample(1:nmix,n,prob=prob, replace=TRUE)
mean.mix <- mean[y]
sd.mix <- sd[y]
rnorm(n,mean.mix,sd.mix)
}
plot(density(rmnorm(10000,mean=c(0,3), sd=c(1,2), prob=c(.5,.5))))
This should be fine if all you need are samples from this mixture distribution.

Related

Generate random data from arbitrary CDF in R?

I have an arbitrary CDF that is applied to a point estimate. I have a number of these point estimates with associated CDFs, that I need to simulate random data for a Monte Carlo simulation.
The CDF I'm generating by doing a spline fit to the arbitrary points provided in a table. For example, the quantile 0.1 is a product of 0.13 * point estimate. The quantile 0.9 is a product of 7.57 * point estimate. It is fairly crude and is based on a large study comparing these models to real world system -- ignore that for now please.
I fit the CDF using a spline fit as shown here.
If I take the derivative of this, I get the shape of the pdf (image).
I modified the function "samplepdf" found here, Sampling from an Arbitrary Density, as follows:
samplecdf <- function(n, cdf, spdf.lower = -Inf, spdf.upper=Inf) {
my_fun <- match.fun(cdf)
invcdf <- function(u) {
subcdf <- function(t) my_fun(t) - u
if (spdf.lower == -Inf)
spdf.lower <- endsign(subcdf, -1)
if (spdf.upper == Inf)
spdf.upper <- endsign(subcdf)
return(uniroot(subcdf, c(spdf.lower, spdf.upper))$root)
}
sapply(runif(n), invcdf)
}
This seems to work, OK - when I compare the quantiles I estimate from the randomly generated data they are fairly close to the initial values. However, when I look at the histogram something funny is happening at the tail where it is looks like my function is consistently generating more values than it should according to the pdf. This function consistently does that across all my point-estimates and even though I can look at the individual quantiles and they seem close, I can tell that the overall Monte Carlo simulation is demonstrating higher estimates for the 50% percentile than I expect. Here is a plot of my histogram of the random samples.
Any tips or advice would be very welcome. I think the best route would be to fit an exponential distribution to the CDF, but I'm struggling to do that. All "fitting" assumes that you have data that needs to be fitted -- this is more arbitrary than that.

Simulating a draw from the distribution of $X$ (in R)

I have a pdf $f(x)=4x^3$ of a random variable $X$ in which I need to simulate a draw from the distribution.
My solution consists of finding the cdf from the pdf (1st issue):
> pdf <- function(x){4*x^3}
> cdf <- integrate(pdf,lower=0,upper=x)
Error in integrate(pdf, lower = 0, upper = x) : object 'x' not found
Once I get the cdf $U$, I will set $X=F^-1(U)$. I notice that the pdf follows a Beta distribution with $\alpha=4$ and $\beta=1$.
Is it best to find the $F^-1$ via a inverse beta function? Is there a quick way to find the inverse of a beta function in R?
Since you have identified your pdf as beta, just use rbeta to sample.
s1 <- rbeta(5000,4,1)
In the case where the distribution is non-standard and you cannot solve analytically, you can use rejection sampling. Let's pretend we don't know your pdf is beta and we don't know how to integrate/inverse.
pdf <- function(x) 4*x^3 # on [0,1]
First we draw from our proposal distribution
p <- runif(50000)
Calculate the density values under our pdf
dp <- pdf(p)
And randomly accept/reject in proportion
s2 <- p[runif(50000) < dp/max(dp)]
You should find the distributions of s1 and s2 comparable, using histograms or, preferably, a qqplot.

Unable to plot PCA data in R. Are scores defined by a given object/name to plot them specifically?

I have completed a simple PCA function using code that was passed down thru the institution. It outputs scores, loadings, eigen values, % eigen values, # of principal components, mean of columns, std deviation, and lastly the starting data. In the output file the scores are labeled with [[1]] before displaying the scores. I am attempting to plot these scores but I am unsure on how to take that data from this point. I assumed it was assigned to this [[1]] or something in the code defined these scores. This line of code is presented below:
"#"perform pca on x
x.svd <- svd(x);
x.R <- x.svd$u %*% diag(x.svd$d);
x.C <- t(x.svd$v);
x.EV <- x.svd$d * x.svd$d
x.EVpct <- x.EV/sum(x.EV);
x.EV <- x.EV[1:sm];
x.EVpct <- x.EVpct[1:sm];
x.CumEVpct <- x.EVpct;
x.R is the part of the code enacting the scores but that too will not work with the plot function. Hopefully someone understands what I am struggling to ask. Any help is very appreciated. Thank you for your time.
The easiest thing to do would be:
pc <- prcomp(x)
plot(pc$x[, 1:2]

R resample data to a log-normal distribution

I have a set of simulated data that are roughly uniformly distributed. I would like to sample a subset of these data and for that subset to have a log-normal distribution with a (log)mean and (log)standard deviation that I specify.
I can figure out some slow brute-force ways to do this, but I feel like there should be a way to do it in a couple lines using the plnorm function and the sample function with the "prob" variable set. I can't seem to get the behavior I'm looking for though. My first attempt was something like:
probs <- plnorm(orig_data, meanlog = mu, sdlog = sigma)
new_data <- sample(orig_data, replace = FALSE, prob = probs)
I think I'm misinterpreting the way the plnorm function behaves. Thanks in advance.
If your orig_data are uniformly distributed between 0 and 1, then
new_data = qlnorm(orig_data, meanlog = mu, sdlog = sigma)
will give log sampled data. IF your data aren't between 0 and 1 but say a and b then first:
orig_data = (orig_data-a)/(b-a)
Generally speaking, uniform RV between 0 and 1 are seen as probability so if you want to sample from a given distribution with it, you have to use q... ie take the corresponding quantile
Thanks guys for the suggestions. While they get me close, I've decided on a slightly different approach for my particular problem, which I'm posting as the solution in case it's useful to others.
One specific I left out of the original question is that I have a whole data set (stored as a data frame), and I want to resample rows from that set such that one of the variables (columns) is log-normally distributed. Here is the function I wrote to accomplish this, which relies on dlnorm to calculate probabilities and sample to resample the data frame:
resample_lognorm <- function(origdataframe,origvals,meanlog,sdlog,n) {
prob <- dlnorm(origvals,meanlog=log(10)*meanlog,sdlog=log(10)*sdlog)
newsamp <- origdataframe[sample(nrow(origdataframe),
size=n,replace=FALSE,prob=prob),]
return(newsamp)
}
In this case origdataframe is the full data frame I want to sample from, and originals is the column of data I want to resample to a log-normal distribution. Note that the log(10) factors in meanlog and sdlog are because I want the distribution to be log-normal in base 10, not natural log.

How to get Inverse CDF (kernel) in R?

Is there any function in R which will calculate the inverse kernel(i am considering normal) CDF for a particular alpha(0,1).
I have found quantile but I am not sure how it works.
Thanks
We can integrate to get the cdf and we can use a root finding algorithm to invert the cdf. First we'll want to interpolate the output from density.
set.seed(10000)
x <- rnorm(1000, 10, 13)
pdf <- density(x)
# Interpolate the density
f <- approxfun(pdf$x, pdf$y, yleft=0, yright=0)
# Get the cdf by numeric integration
cdf <- function(x){
integrate(f, -Inf, x)$value
}
# Use a root finding function to invert the cdf
invcdf <- function(q){
uniroot(function(x){cdf(x) - q}, range(x))$root
}
which gives
med <- invcdf(.5)
cdf(med)
#[1] 0.5000007
This could definitely be improved upon. One issue is that I don't guarantee that the cdf is always less than or equal to 1 (and if you check the cdf for values larger than max(x) you might get something like 1.00097. But I'm too tired to fix that now. This should give a decent start.
An alternative approach would be to use log-spline density estimation rather than kernel density estimation. Look at the 'logspline' package. With logspline density estimations you get CDF (plogspline) and inverse CDF (qlogspline) functions.

Resources