How to generate a probability density function and expectation in r? - r

The task:
Eric the fly has a friend, Ernie. Assume that the two flies sit at independent locations, uniformly distributed on the globe’s surface. Let D denote the Euclidean distance between Eric and Ernie (i.e., on a straight line through the interior of the globe).
Make a conjecture about the probability density function of D and give an
estimate of its expected value, E(D).
So far I have made a function to generate two points on the globe's surface, but I am unsure what to do next:
sample3d <- function(2)
{
df <- data.frame()
while(n > 0){
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1){
u <- sqrt(x^2+y^2+z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
E <- sample3d(2)

This is an interesting problem. I'll outline a computational approach; I'll leave the math up to you.
First we fix a random seed for reproducibility.
set.seed(2018);
We sample 10^4 points from the unit sphere surface.
sample3d <- function(n = 100) {
df <- data.frame();
while(n > 0) {
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1) {
u <- sqrt(x^2 + y^2 + z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
df <- sample3d(10^4);
Note that sample3d is not very efficient, but that's a different issue.
We now randomly sample 2 points from df, calculate the Euclidean distance between those two points (using dist), and repeat this procedure N = 10^4 times.
# Sample 2 points randomly from df, repeat N times
N <- 10^4;
dist <- replicate(N, dist(df[sample(1:nrow(df), 2), ]));
As pointed out by #JosephWood, the number N = 10^4 is somewhat arbitrary. We are using a bootstrap to derive the empirical distribution. For N -> infinity one can show that the empirical bootstrap distribution is the same as the (unknown) population distribution (Bootstrap theorem). The error term between empirical and population distribution is of the order 1/sqrt(N), so N = 10^4 should lead to an error around 1%.
We can plot the resulting probability distribution as a histogram:
# Let's plot the distribution
ggplot(data.frame(x = dist), aes(x)) + geom_histogram(bins = 50);
Finally, we can get empirical estimates for the mean and median.
# Mean
mean(dist);
#[1] 1.333021
# Median
median(dist);
#[1] 1.41602
These values are close to the theoretical values:
mean.th = 4/3
median.th = sqrt(2)

Related

Problem with theoretical density functions plotted to histograms in R

I did simulations in R and plotted the results in histograms. There is no problem adding probability density into the histograms with the density() function. But for some reason I get very strange results when I plot the theoretical densities to the histograms for comparison purposes. Here are two example codes and two pictures. The blue theoretical pdfs are way off and I do not know why. Could someone with R skills point out my mistakes?
# generating 10000 samples thrice from U(0,1) distribution
# and sorting them for the statistics:
n <- 10000
samples1 <- data.frame('0'=c(rep(NA,4)))
samples2 <- data.frame('0'=c(rep(NA,10)))
samples3 <- data.frame('0'=c(rep(NA,10)))
for (i in 1:n) {
new <- runif(4)
samples1[ , ncol(samples1) + 1] <- sort(new)
colnames(samples1)[ncol(samples1)] <- i
new <- runif(10)
samples2[ , ncol(samples2)+1] <- sort(new)
colnames(samples2)[ncol(samples2)] <- i
new <- runif(10)
samples3[ , ncol(samples3)+1] <- sort(new)
colnames(samples3)[ncol(samples3)] <- i
}
# dropping the first (useless) columns:
samples1 <- samples1[-c(1)]
samples2 <- samples2[-c(1)]
samples3 <- samples3[-c(1)]
# selecting the statistics from the samples:
# X_2:4
stat24 <- rep(NA,n)
for (i in 1:n) {
stat24[i] <- samples1[2,i]
}
# X_2:10
stat210 <- rep(NA,n)
for (i in 1:n) {
stat210[i] <- samples1[2,i]
}
# X_10:10
stat1010 <- rep(NA,n)
for (i in 1:n) {
stat1010[i] <- samples1[10,i]
}
# plotting the histograms and Beta pdfs:
hist(stat24, freq = FALSE)
lines(dbeta(stat24, 2, 5), col='blue')
lines(density(stat24), col='red')
Distribution of the kth statistic follows Beta(k, n+k-1) distribution which appears as the odd blue stroke on the right.
n <- 10000
random_variable_F <- rep(NA,n)
# generating 10000 samples of sizes 10 and 5 and computing F:
for (i in 1:n) {
x <- rnorm(10, mean = 10, sd = sqrt(5))
y <- rnorm(5, mean = 20, sd = sqrt(10))
random_variable_F[i] <- ((var(x))*5)/((var(y)*10))
}
#head(random_variable_F)
# plotting the histogram:
hist(random_variable_F, freq = F)
lines(density(random_variable_F), col='red')
lines(df(random_variable_F, 9, 4,), col='blue')
Random variable F follows F-distribution. Paramaters are the sample sizes minus one, in this case 10-1=9 and 5-1=4. the theoretical curve is quite wild:
If you pass a single vector to lines, it assumes that this is a vector of y values you want to plot. It plots the first y value at x = 1, the second y value at x = 2, etc, all the way up to x = length(y). In your case, random_variable_F is an unordered random variable, and you are just plotting its sequential values at 1:10000 along the x axis.
Clearly, you want the function y = df(x) to be plotted, so you need to pass random_variable_F as the x values and df(random_variable_F) as the y values. You will also need to sort random_variable_F first to ensure the line is plotted from left to right:
hist(random_variable_F, freq = F)
lines(density(random_variable_F), col='red')
lines(sort(random_variable_F), df(sort(random_variable_F), 9, 4), col='blue')
Note that this doesn't happen when you plot lines(density(random_variable_F)) because density produces a list containing ordered x and y valued rather than a vector.

Generating 3D data with cube as a decision surface

I am new to using r program. I have a task to use r to create a function to simulate standard normal distribution containing 500 observations and three variables, x,y,& z.
I am to use cube as a decision surface to categorize observations based on whether they fell within or outside the cube.
Below is my code. I am able to plot the 3D data, but I am not sure of how to categorize the datasets into two classes.
library(scatterplot3d)
set.seed (1234)
nObs <- 500
x <- matrix (rnorm (1.25*nObs), ncol =2)
y <- matrix (rnorm (1.25*nObs), ncol =2)
z <- matrix (rnorm (1.25*nObs), ncol =2)
mSample <- function(nObs,x,y,z){
x1 <- rnorm(1,x)
x1[y==1,] <- x[y==1,] + 1
mSample <- as_tibble(rbind(mvnfast::rmvn(x,y = y1,z = z1), mvnfast::rmvn(x,y = y1,z = z1)))
mSample[1:x1, 1.25] <- 0
mSample[(x1 + 1):(x1 + 1), 1.25] <- 1
mSample <- mSample[sample(nrow(mSample)), ]
colnames(mSample <- c("x", "y", "class"))
mSample
}
spl <- scatterplot3d(x,y,z)
spl <- scatterplot3d(x,y,z,pch=16,highlight.3d=TRUE)
I had a similar question to this recently. Basically, to know if a given point is inside or outside of a cube, first you need to know the length of the cube.
Then, simply iterate over all the points (nObs) and do an if statement
if (x > -cubeLength ** x < cubeLength && y > -cubeLength ** y < cubeLength && z > -cubeLength ** z < cubeLength) {
classify positive
}
else {
classify negative
}

Ratio-of-Uniforms Distribution in R

I have an exercise, in which i have to create an algorithm as follows:
ratio of Uniforms is based on the fact that for a random variable X with density f(x) we can generate X from the desired density by calculating X = U/V for a pair (U, V ) uniformly distributed in the set
Af = {(u,v):0 < v ≤ f(u/v)}
Random points can be sampled uniformly in Af by rejection from the min- imal bounding rectangle, i.e., the smallest possible rectangle that contains Af .
It is given by (u−, u+) × (0, v+) where
v+ = max f(x), x
u− = minx f(x), x
u+ = maxx f(x)
Then the Ratio-of-Uniforms method consists of the following simple steps:
Generate random number U uniformly in (u−, u+).
Generate random number V uniformly in (0, v+).
Set X ← U/V .
If V 2 ≤ f(X) accept and return X.
Else try again.
My code so far:
x <- cnorm(1, mean = 0, sd=1)
myrnorm <- function(pdf){
## call rou() n times
pdf <- function(x) {exp(-x^2/2)}
}
rou <- function(u, v) {
uplus <- 1
vplus <- 1
n <- 100
u <- runif(n, min=0, max=uplus)
v <- runif(n, min=0, max=vplus)
xi <- v/u
while(v < sqrt(xi)) {
if(v^2 <= xi)
return(xi)
}
}
myx <- myrnorm(1000)
hist(myx)
But I really dont know how to go on. Im ´lost with this exercise. I would be really grateful for any advise.
Following example 1 in page 8 of this link and your sample code, I came up this solution:
ratioU <- function(nvals)
{
h_x = function(x) exp(-x)
# u- is b-, u+ is b+ and v+ is a in the example:
uminus = 0
uplus = 2/exp(1)
vplus = 1
X.vals <- NULL
i <- 0
repeat {
i <- i+1
u <- runif(1,0,vplus)
v <- runif(1,uminus,uplus)
X <- u/v
if(v^2 <= h_x(X)) {
tmp <- X
}
else {
next
}
X.vals <- c(X.vals,tmp)
if(length(X.vals) >= nvals) break
}
answer <- X.vals
answer
}
sol = ratioU(1000)
par(mfrow=c(1,2))
hist(sol,breaks=50, main= "using ratioU",freq=F)
hist(rexp(1000),breaks = 50, main="using rexp from R",freq=F)
par(mfrow=c(1,1))
par(mfrow=c(1,2))
plot(density(sol))
plot(density(rexp(1000)))
par(mfrow=c(1,1))
A lot of the code may be optimized but I think it is good enough like this for this purpose. I hope this helps.

Simulate array with scatter & known relation to X

This is a very basic R question, but I can't seem to find the right packages to do what I want.
I have an array 'X', with n values. I want to simulate an array, 'Y', that follows a known relation Y = alpha + beta*X. Furthermore, I want to add intrinsic scatter to the Y array. Alpha, beta, and the intrinsic scatter should be input values by the user.
Can someone help me with how I would go about doing this?
Thanks!
Do you mean like this?
> x <- 1:5
> alpha <- 2
> beta <- 3
> y <- alpha + beta * x
> y
[1] 5 8 11 14 17
And by "scatter" do you mean random noise? You can simulate that by added random values like so (I am using a normal distribution) :
> y <- alpha + beta * x + rnorm(5)
> y
[1] 4.710538 7.700785 10.588489 14.252223 16.108079
Here is a function that creates the deterministic part of the correlation and then adds noise via rnorm
make_correlation <- function(alpha, beta, scatter, x){
# make deterministic part
y_det <- alpha + beta*x
# add noise
y <- rnorm(length(x), y_det, scatter)
return(y)
}
set.seed(20)
x <- runif(20, 0, 10)
answer <- make_correlation(alpha = 2, beta = 3, scatter = 2, x)
plot(answer~x)

Adding two random variables via convolution in R

I would like to compute the convolution of two probability distributions in R and I need some help. For the sake of simplicity, let's say I have a variable x that is normally distributed with mean = 1.0 and stdev = 0.5, and y that is log-normally distributed with mean = 1.5 and stdev = 0.75. I want to determine z = x + y. I understand that the distribution of z is not known a priori.
As an aside the real world example I am working with requires addition to two random variables that are distributed according to a number of different distributions.
Does anyone know how to add two random variables by convoluting the probability density functions of x and y?
I have tried generating n normally distributed random values (with above parameters) and adding them to n log-normally distributed random values. However, I wish to know if I can use the convolution method instead. Any help would be greatly appreciated.
EDIT
Thank you for these answers. I define a pdf, and try to do the convolution integral, but R complains on the integration step. My pdfs are Log Pearson 3 and are as follows
dlp3 <- function(x, a, b, g) {
p1 <- 1/(x*abs(b) * gamma(a))
p2 <- ((log(x)-g)/b)^(a-1)
p3 <- exp(-1* (log(x)-g) / b)
d <- p1 * p2 * p3
return(d)
}
f.m <- function(x) dlp3(x,3.2594,-0.18218,0.53441)
f.s <- function(x) dlp3(x,9.5645,-0.07676,1.184)
f.t <- function(z) integrate(function(x,z) f.s(z-x)*f.m(x),-Inf,Inf,z)$value
f.t <- Vectorize(f.t)
integrate(f.t, lower = 0, upper = 3.6)
R complains at the last step since the f.t function is bounded and my integration limits are probably not correct. Any ideas on how to solve this?
Here is one way.
f.X <- function(x) dnorm(x,1,0.5) # normal (mu=1.5, sigma=0.5)
f.Y <- function(y) dlnorm(y,1.5, 0.75) # log-normal (mu=1.5, sigma=0.75)
# convolution integral
f.Z <- function(z) integrate(function(x,z) f.Y(z-x)*f.X(x),-Inf,Inf,z)$value
f.Z <- Vectorize(f.Z) # need to vectorize the resulting fn.
set.seed(1) # for reproducible example
X <- rnorm(1000,1,0.5)
Y <- rlnorm(1000,1.5,0.75)
Z <- X + Y
# compare the methods
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
Same thing using package distr.
library(distr)
N <- Norm(mean=1, sd=0.5) # N is signature for normal dist
L <- Lnorm(meanlog=1.5,sdlog=0.75) # same for log-normal
conv <- convpow(L+N,1) # object of class AbscontDistribution
f.Z <- d(conv) # distribution function
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
I was having trouble getting integrate() to work for different density parameters, so I came up with an alternative to #jlhoward's using Riemann approximation:
set.seed(1)
#densities to be convolved. could also put these in the function below
d1 <- function(x) dnorm(x,1,0.5) #
d2 <- function(y) dlnorm(y,1.5, 0.75)
#Riemann approximation of convolution
conv <- function(t, a, b, d) { #a to b needs to cover the range of densities above. d needs to be small for accurate approx.
z <- NA
x <- seq(a, b, d)
for (i in 1:length(t)){
print(i)
z[i] <- sum(d1(x)*d2(t[i]-x)*d)
}
return(z)
}
#check against sampled convolution
X <- rnorm(1000, 1, 0.5)
Y <- rlnorm(1000, 1.5, 0.75)
Z <- X + Y
t <- seq(0, 50, 0.05) #range to evaluate t, smaller increment -> smoother curve
hist(Z, breaks = 50, freq = F, xlim = c(0,30))
lines(t, conv(t, -100, 100, 0.1), type = "s", col = "red")

Resources