Central Limit Theorem in R - r

I wish to simulate the central limit theorem in order to demonstrate it, and I am not sure how to do it in R. I want to create 10,000 samples with a sample size of n (can be numeric or a parameter), from a distribution I will choose (uniform, exponential, etc...). Then I want to graph in one plot (using the par and mfrow commands) the original distribution (histogram), the distribution of the means of all samples, a Q-Q plot of the means, and in the 4th graph (there are four, 2X2), I am not sure what to plot. Can you please assist me in starting to program it in R ? I think once I have the simulated data I should be fine. Thank you.
My initial attempt is below, it is too simple and I am not sure even correct.
r = 10000;
n = 20;
M = matrix(0,n,r);
Xbar = rep(0,r);
for (i in 1:r)
{
M[,i] = runif(n,0,1);
}
for (i in 1:r)
{
Xbar[i] = mean(M[,i]);
}
hist(Xbar);

The CLT states that given i.i.d. samples from a distribution with mean and variance, the sample mean (as a random variable) has a distribution that converges to a Gaussian as the number of samples n increase. Here, I will assume that you want to generate r sample sets containing n samples each to create r samples of the sample mean. Some code to do that is as follows:
set.seed(123) ## set the seed for reproducibility
r <- 10000
n <- 200 ## I use 200 instead of 20 to enhance convergence to Gaussian
## this function computes the r samples of the sample mean from the
## r*n original samples
sample.means <- function(samps, r, n) {
rowMeans(matrix(samps,nrow=r,ncol=n))
}
For generating the plots, we use ggplot2 and Aaron's qqplot.data function from here. We also use gridExtra to plot multiple plots in one frame.
library(ggplot2)
library(gridExtra)
qqplot.data <- function (vec) {
# following four lines from base R's qqline()
y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
d <- data.frame(resids = vec)
ggplot(d, aes(sample = resids)) + stat_qq() + geom_abline(slope = slope, intercept = int, colour="red") + ggtitle("Q-Q plot")
}
generate.plots <- function(samps, samp.means) {
p1 <- qplot(samps, geom="histogram", bins=30, main="Sample Histogram")
p2 <- qplot(samp.means, geom="histogram", bins=30, main="Sample Mean Histogram")
p3 <- qqplot.data(samp.means)
grid.arrange(p1,p2,p3,ncol=2)
}
Then we can use these functions with the uniform distribution:
samps <- runif(r*n) ## uniform distribution [0,1]
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Or, with the poisson distribution with mean = 3:
samps <- rpois(r*n,lambda=3)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Or, with the exponential distribution with mean = 1/1:
samps <- rexp(r*n,rate=1)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)
We get:
Note that the mean of the sample mean histograms all look like Gaussians with mean that is very similar to the mean of the original generating distribution, whether this is uniform, poisson, or exponential, as predicted by the CLT (also its variance will be 1/(n=200) the variance of the original generating distribution).

Maybe this can help you get started. I have hard-coded the normal distribution and only shown two of your suggested plots: a the histogram of a randomly selected sample, and a histogram of all sample means.
I guess my main suggestion is using a list to store the samples instead of a matrix.
r <- 10000
my.n <- 20
simulation <- list()
for (i in 1:r) {
simulation[[i]] <- rnorm(my.n)
}
sample.means <- sapply(simulation, mean)
selected.sample <- runif(1, min = 1, max = r)
dev.off()
par(mfrow = c(1, 2))
hist(simulation[[selected.sample]])
hist(sample.means)

Related

Simulate in R the number of samples needed in order to achieve the true standard deviation

i want to recreate in R the figure above that simulates the number of samples needed in order to achieve the true standard deviation.
How can I do it in R ?
I suppose that the distribution is t-distribution or normal.
So I have to generate numbers from these distributions and each time to increase the size of the sample and plot it in order to recreate this plot as shown in the figure.
Any help ?
set.seed(123)
x <- list(v1=rnorm(1,0,12),v2=rnorm(10,0,11),
v3=rnorm(20,0,10),v4=rnorm(30,0,9),
v5=rnorm(40,0,8),v6=rnorm(50,0,7),
v7=rnorm(60,0,6),v8=rnorm(70,0,5),
v9=rnorm(80,0,4),v10=rnorm(90,0,3),
v11=rnorm(100,0,2),v12=rnorm(110,0,2))
g = lapply(x,sd)
g
g1 = unlist(g)
plot(g1,type="l")
First, start with a random uniform distribution of suitable size, and select which sample sizes you want to compute your standard error of the mean.
set.seed(123)
x <- runif(1e6, 0, 1)
sample_size <- 5:120
You can define a function to compute this sigma_m. Here you sample with replacement a sample of n from x, and take the standard deviation and divide by sqrt(n).
calc_sigma_m <- function(n, x) {
sd(sample(x, n, replace = TRUE))/sqrt(n)
}
A data frame can neatly store the sample sizes and sigma_m values for plotting:
df <- data.frame(sample_size,
sigma_m = sapply(sample_size, calc_sigma_m, x))
Your initial plot will look like this:
library(ggplot2)
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()
As expected, this is not smooth especially at smaller sample sizes.
If you want a smooth curve for demonstration, you repeat the sampling process and sigma_m calculation many times, and take the mean.
calc_sigma_m_mean <- function(n, x) {
mean(replicate(1000, sd(sample(x, n, replace = TRUE))/sqrt(n)))
}
df <- data.frame(sample_size, sigma_m = sapply(sample_size, calc_sigma_m_mean, x))
Then you will get a smoother curve:
ggplot(df, aes(sample_size, sigma_m)) +
geom_line()

How to run monte carlo simulation from a custom distribution in R

I would like to pull 1000 samples from a custom distribution in R
I have the following custom distribution
library(gamlss)
mu <- 1
sigma <- 2
tau <- 3
kappa <- 3
rate <- 1
Rmax <- 20
x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)
How can I randomly sample via Monte Carlo simulation from this distribution?
My first attempt was the following code which produced a histogram shape I did not expect.
hist(sample(points, 1000), breaks = 51)
This is not what I was looking for as it does not follow the same distribution as the pdf.
If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.
Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):
hist(points, 100)
I used 100 breaks on purpose so you could see some of the fine details.
Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)
If you want a Monte Carlo simulation of a sample from this object, you might try something like:
samples <- replicate(1000, sample(points, 100, replace = TRUE))
If you want to generate data using points as a probability density function, that question has been asked and answered here
Let's define your (not normalized) probability density function as a function:
library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) *
pgamma(x, shape = kappa, rate = rate)
Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,
simMCMC <- function(N, init, fun, ...) {
out <- numeric(N)
out[1] <- init
for(i in 2:N) {
pr <- out[i - 1] + rnorm(1, ...)
r <- fun(pr) / fun(out[i - 1])
out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
}
out
}
It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:
d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))
You invert the ECDF of the distribution:
ecd.points <- ecdf(points)
invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
plot(density (samp.inv.ecd(100) ) )
plot(density(points) )
png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
plot(density(points) , main="The Original"); dev.off()
Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:
x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)
> C$value
[1] 11.50361
# normalize by C$value
f <- function (x)
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)
random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)
hist((random.points*40),1000) will get the scaling like your original function.

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

Creating a histogram from iterations of a binomial distribution in R

Here are the instructions:
Create 10,000 iterations (N = 10,000) of
rbinom(50,1, 0.5) with n = 50 and your guess of p0 = 0.50 (hint: you will need to
construct a for loop). Plot a histogram of the results of the sample. Then plot your
pstar on the histogram. If pstar is not in the extreme region of the histogram, you would
assume your guess is correct and vice versa. Finally calculate the probability that
p0 < pstar (this is a p value).
I know how to create the for loop and the rbinom function, but am unsure on how transfer this information to plotting on a histogram, in addition to plotting a custom point (my guess value).
I'm not doing your homework for you, but this should get you started. You don't say what pstar is supposed to be, so I am assuming you are interested in the (distribution of the) maximum likelihood estimates for p.
You create 10,000 N=50 binomial samples (there is no need for a for loop):
sample <- lapply(seq(10^5), function(x) rbinom(50, 1, 0.5))
The ML estimates for p are then
phat <- sapply(sample, function(x) sum(x == 1) / length(x))
Inspect the distribution
require(ggplot)
ggplot(data.frame(phat = phat), aes(phat)) + geom_histogram(bins = 30)
and calculate the probability that p0 < phat.
Edit 1
If you insist, you can also use a for loop to generate your samples.
sample <- list();
for (i in 1:10^5) {
sample[[i]] <- rbinom(50, 1, 0.5);
}

Generating samples from a two-Gaussian mixture in r (code given in MATLAB)

I'm trying to create (in r) the equivalent to the following MATLAB function that will generate n samples from a mixture of N(m1,(s1)^2) and N(m2, (s2)^2) with a fraction, alpha, from the first Gaussian.
I have a start, but the results are notably different between MATLAB and R (i.e., the MATLAB results give occasional values of +-8 but the R version never even gives a value of +-5). Please help me sort out what is wrong here. Thanks :-)
For Example:
Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one.
MATLAB
function
function y = gaussmix(n,m1,m2,s1,s2,alpha)
y = zeros(n,1);
U = rand(n,1);
I = (U < alpha)
y = I.*(randn(n,1)*s1+m1) + (1-I).*(randn(n,1)*s2 + m2);
implementation
P = gaussmix(1000,0,0,1,6,.95)
P = (P-mean(P))/std(P)
plot(P)
axis([0 1000 -15 15])
hist(P)
axis([-15 15 0 1000])
resulting plot
resulting hist
R
yn <- rbinom(1000, 1, .95)
s <- rnorm(1000, 0 + 0*yn, 1 + 36*yn)
sn <- (s-mean(s))/sd(s)
plot(sn, xlim=range(0,1000), ylim=range(-15,15))
hist(sn, xlim=range(-15,15), ylim=range(0,1000))
resulting plot
resulting hist
As always, THANK YOU!
SOLUTION
gaussmix <- function(nsim,mean_1,mean_2,std_1,std_2,alpha){
U <- runif(nsim)
I <- as.numeric(U<alpha)
y <- I*rnorm(nsim,mean=mean_1,sd=std_1)+
(1-I)*rnorm(nsim,mean=mean_2,sd=std_2)
return(y)
}
z1 <- gaussmix(1000,0,0,1,6,0.95)
z1_standardized <- (z1-mean(z1))/sqrt(var(z1))
z2 <- gaussmix(1000,0,3,1,1,0.80)
z2_standardized <- (z2-mean(z2))/sqrt(var(z2))
z3 <- rlnorm(1000)
z3_standardized <- (z3-mean(z3))/sqrt(var(z3))
par(mfrow=c(2,3))
hist(z1_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 95% of N(0,1) and 5% of N(0,36)",
col="blue",xlab=" ")
hist(z2_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 80% of N(0,1) and 10% of N(3,1)",
col="blue",xlab=" ")
hist(z3_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of samples of LN(0,1)",col="blue",xlab=" ")
##
plot(z1_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(0,36)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z2_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(3,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z3_standardized,type='l',
main="1000 samples from LN(0,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
There are two problems, I think ... (1) your R code is creating a mixture of normal distributions with standard deviations of 1 and 37. (2) By setting prob equal to alpha in your rbinom() call, you're getting a fraction alpha in the second mode rather than the first. So what you are getting is a distribution that is mostly a Gaussian with sd 37, contaminated by a 5% mixture of Gaussian with sd 1, rather than a Gaussian with sd 1 that is contaminated by a 5% mixture of a Gaussian with sd 6. Scaling by the standard deviation of the mixture (which is about 36.6) basically reduces it to a standard Gaussian with a slight bump near the origin ...
(The other answers posted here do solve your problem perfectly well, but I thought you might be interested in a diagnosis ...)
A more compact (and perhaps more idiomatic) version of your Matlab gaussmix function (I think runif(n)<alpha is slightly more efficient than rbinom(n,size=1,prob=alpha) )
gaussmix <- function(n,m1,m2,s1,s2,alpha) {
I <- runif(n)<alpha
rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}
set.seed(1001)
s <- gaussmix(1000,0,0,1,6,0.95)
Not that you asked for it, but the mclust package offers a way to generalize your problem to more dimensions and diverse covariance structures. See ?mclust::sim. The example task would be done this way:
require(mclust)
simdata = sim(modelName = "V",
parameters = list(pro = c(0.95, 0.05),
mean = c(0, 0),
variance = list(modelName = "V",
d = 1,
G = 2,
sigmasq = c(0, 36))),
n = 1000)
plot(scale(simdata[,2]), type = "h")
I recently wrote the density and sampling function of a multinomial mixture of normal distributions:
dmultiNorm <- function(x,means,sds,weights)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
N <- length(x)
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
weights <- weights/sum(weights)
dens <- numeric(N)
for (i in 1:n)
{
dens <- dens + weights[i] * dnorm(x,means[i],sds[i])
}
return(dens)
}
rmultiNorm <- function(N,means,sds,weights,scale=TRUE)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
Res <- numeric(N)
for (i in 1:N)
{
s <- sample(1:n,1,prob=weights)
Res[i] <- rnorm(1,means[s],sds[s])
}
return(Res)
}
With means being a vector of means, sds being a vector of standard deviatians and weights being a vector with proportional probabilities to sample from each of the distributions. Is this useful to you?
Here is code to do this task:
"For Example: Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one."
plot(multG <- c( rnorm(950), rnorm(50, 0, 36))[sample(1000)] , type="h")
scmulG <- scale(multG)
summary(scmulG)
#-----------
V1
Min. :-9.01845
1st Qu.:-0.06544
Median : 0.03841
Mean : 0.00000
3rd Qu.: 0.13940
Max. :12.33107

Resources