Simulate array with scatter & known relation to X - r

This is a very basic R question, but I can't seem to find the right packages to do what I want.
I have an array 'X', with n values. I want to simulate an array, 'Y', that follows a known relation Y = alpha + beta*X. Furthermore, I want to add intrinsic scatter to the Y array. Alpha, beta, and the intrinsic scatter should be input values by the user.
Can someone help me with how I would go about doing this?
Thanks!

Do you mean like this?
> x <- 1:5
> alpha <- 2
> beta <- 3
> y <- alpha + beta * x
> y
[1] 5 8 11 14 17
And by "scatter" do you mean random noise? You can simulate that by added random values like so (I am using a normal distribution) :
> y <- alpha + beta * x + rnorm(5)
> y
[1] 4.710538 7.700785 10.588489 14.252223 16.108079

Here is a function that creates the deterministic part of the correlation and then adds noise via rnorm
make_correlation <- function(alpha, beta, scatter, x){
# make deterministic part
y_det <- alpha + beta*x
# add noise
y <- rnorm(length(x), y_det, scatter)
return(y)
}
set.seed(20)
x <- runif(20, 0, 10)
answer <- make_correlation(alpha = 2, beta = 3, scatter = 2, x)
plot(answer~x)

Related

How to find all the combinations of variables (between 2 threshold values) that make an equation hold true?

Given the equation: 10 = 2x + 3y + 2z, I want to find all the combination of x, y and z between 2 thresholds (e.g. -100 and 100) that make the equation hold true.
So far, I have tried the following with no success (please see the comments):
set.seed(333)
result = 10
# sample random values from uniform distribution
x = runif(100, -100, 100)
y = runif(100, -100, 100)
z = runif(100, -100, 100)
# store the vectors in a single dataframe
df = data.frame(x, y, z)
# find all the combinations of x, y and z
expanded = expand.grid(df)
# calculation with all the combinations
expanded$result = ((expanded$x * 2) + (expanded$y * 3) + (expanded$z * 2)) == result
# show the data where result is 10
expanded[expanded$result, ]
[1] x y z result
<0 rows> (or 0-length row.names)
How would I achieve this?
A linear equation of three variables can be thought of as a plane in 3D space. Solving your equation for z we have:
z = 5 - x - 1.5 * y
So we can at least see what the solution looks like. Let's examine all the valid integer values of x and y:
df <- expand.grid(x = seq(-100, 100), y = seq(-100, 100))
We can then get the corresponding values of z:
df$z <- 5 - df$x - 1.5 * df$y
But we need to disallow any values of z below -100 or above 100:
df$z[abs(df$z) > 100] <- NA
And we can plot the resultant plane with the z value shown as a fill color:
ggplot(df, aes(x, y, fill = z)) +
geom_raster() +
scale_fill_viridis_c() +
coord_equal()
These at a glance, are all 25,184 integer solutions of your equation with the given constraints (as others have pointed out, there are of course an infinite number of non-integer solutions), and it's a reasonable approximation of the whole system. You can see all these combinations with
df[!is.na(df$z),]
As multiple commenters have noted, your equation 10 = 2x + 3y + 2z forms a plane of possible values.
Consider this:
library(plotly)
myfunc <- \(x,y)5-x-(3/2 * y)
y <- x <- seq(-100,100,1)
z <- outer(x, y, myfunc)
plot_ly(x = x, y = y, z = z) %>%
add_surface()
As you can see, given unlimited precision, there are an infinite number of points on that plane.

Generate a binary variable with a predefined correlation to an already existing variable

For a simulation study, I want to generate a set of random variables (both continuous and binary) that have predefined associations to an already existing binary variable, denoted here as x.
For this post, assume that x is generated following the code below. But remember: in real life, x is an already existing variable.
set.seed(1245)
x <- rbinom(1000, 1, 0.6)
I want to generate both a binary variable and a continuous variable. I have figured out how to generate a continuous variable (see code below)
set.seed(1245)
cor <- 0.8 #Correlation
y <- rnorm(1000, cor*x, sqrt(1-cor^2))
But I can't find a way to generate a binary variable that is correlated to the already existing variable x. I found several R packages, such as copula which can generate random variables with a given dependency structure. However, they do not provide a possibility to generate variables with a set dependency on an already existing variable.
Does anyone know how to do this in an efficient way?
Thanks!
If we look at the formula for correlation:
For the new vector y, if we preserve the mean, the problem is easier to solve. That means we copy the vector x and try to flip a equal number of 1s and 0s to achieve the intended correlation value.
If we let E(X) = E(Y) = x_bar , and E(XY) = xy_bar, then for a given rho, we simplify the above to:
(xy_bar - x_bar^2) / (x_bar - x_bar^2) = rho
Solve and we get:
xy_bar = rho * x_bar + (1-rho)*x_bar^2
And we can derive a function to flip a number of 1s and 0s to get the result:
create_vector = function(x,rho){
n = length(x)
x_bar = mean(x)
xy_bar = rho * x_bar + (1-rho)*x_bar^2
toflip = sum(x == 1) - round(n * xy_bar)
y = x
y[sample(which(x==0),toflip)] = 1
y[sample(which(x==1),toflip)] = 0
return(y)
}
For your example it works:
set.seed(1245)
x <- rbinom(1000, 1, 0.6)
cor(x,create_vector(x,0.8))
[1] 0.7986037
There are some extreme combinations of intended rho and p where you might run into problems, for example:
set.seed(111)
res = lapply(1:1000,function(i){
this_rho = runif(1)
this_p = runif(1)
x = rbinom(1000,1,this_p)
data.frame(
intended_rho = this_rho,
p = this_p,
resulting_cor = cor(x,create_vector(x,this_rho))
)
})
res = do.call(rbind,res)
ggplot(res,aes(x=intended_rho,y=resulting_cor,col=p)) + geom_point()
Here's a binomial one - the formula for q only depends on the mean of x and the correlation you desire.
set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- 1/((1-p)/cor^2+p)
y <- rbinom(100000, 1, q)
z <- x*y
cor(x,z)
#> [1] 0.7984781
This is not the only way to do this - note that mean(z) is always less than mean(x) in this construction.
The continuous variable is even less well defined - do you really not care about its mean/variance, or anything else about its distibution?
Here's another simple version where it flips the variable both ways:
set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- (1+cor/sqrt(1-(2*p-1)^2*(1-cor^2)))/2
y <- rbinom(100000, 1, q)
z <- x*y+(1-x)*(1-y)
cor(x,z)
#> [1] 0.8001219
mean(z)
#> [1] 0.57908

How to generate a probability density function and expectation in r?

The task:
Eric the fly has a friend, Ernie. Assume that the two flies sit at independent locations, uniformly distributed on the globe’s surface. Let D denote the Euclidean distance between Eric and Ernie (i.e., on a straight line through the interior of the globe).
Make a conjecture about the probability density function of D and give an
estimate of its expected value, E(D).
So far I have made a function to generate two points on the globe's surface, but I am unsure what to do next:
sample3d <- function(2)
{
df <- data.frame()
while(n > 0){
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1){
u <- sqrt(x^2+y^2+z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
E <- sample3d(2)
This is an interesting problem. I'll outline a computational approach; I'll leave the math up to you.
First we fix a random seed for reproducibility.
set.seed(2018);
We sample 10^4 points from the unit sphere surface.
sample3d <- function(n = 100) {
df <- data.frame();
while(n > 0) {
x <- runif(1,-1,1)
y <- runif(1,-1,1)
z <- runif(1,-1,1)
r <- x^2 + y^2 + z^2
if (r < 1) {
u <- sqrt(x^2 + y^2 + z^2)
vector = data.frame(x = x/u,y = y/u, z = z/u)
df <- rbind(vector,df)
n = n- 1
}
}
df
}
df <- sample3d(10^4);
Note that sample3d is not very efficient, but that's a different issue.
We now randomly sample 2 points from df, calculate the Euclidean distance between those two points (using dist), and repeat this procedure N = 10^4 times.
# Sample 2 points randomly from df, repeat N times
N <- 10^4;
dist <- replicate(N, dist(df[sample(1:nrow(df), 2), ]));
As pointed out by #JosephWood, the number N = 10^4 is somewhat arbitrary. We are using a bootstrap to derive the empirical distribution. For N -> infinity one can show that the empirical bootstrap distribution is the same as the (unknown) population distribution (Bootstrap theorem). The error term between empirical and population distribution is of the order 1/sqrt(N), so N = 10^4 should lead to an error around 1%.
We can plot the resulting probability distribution as a histogram:
# Let's plot the distribution
ggplot(data.frame(x = dist), aes(x)) + geom_histogram(bins = 50);
Finally, we can get empirical estimates for the mean and median.
# Mean
mean(dist);
#[1] 1.333021
# Median
median(dist);
#[1] 1.41602
These values are close to the theoretical values:
mean.th = 4/3
median.th = sqrt(2)

Adding two random variables via convolution in R

I would like to compute the convolution of two probability distributions in R and I need some help. For the sake of simplicity, let's say I have a variable x that is normally distributed with mean = 1.0 and stdev = 0.5, and y that is log-normally distributed with mean = 1.5 and stdev = 0.75. I want to determine z = x + y. I understand that the distribution of z is not known a priori.
As an aside the real world example I am working with requires addition to two random variables that are distributed according to a number of different distributions.
Does anyone know how to add two random variables by convoluting the probability density functions of x and y?
I have tried generating n normally distributed random values (with above parameters) and adding them to n log-normally distributed random values. However, I wish to know if I can use the convolution method instead. Any help would be greatly appreciated.
EDIT
Thank you for these answers. I define a pdf, and try to do the convolution integral, but R complains on the integration step. My pdfs are Log Pearson 3 and are as follows
dlp3 <- function(x, a, b, g) {
p1 <- 1/(x*abs(b) * gamma(a))
p2 <- ((log(x)-g)/b)^(a-1)
p3 <- exp(-1* (log(x)-g) / b)
d <- p1 * p2 * p3
return(d)
}
f.m <- function(x) dlp3(x,3.2594,-0.18218,0.53441)
f.s <- function(x) dlp3(x,9.5645,-0.07676,1.184)
f.t <- function(z) integrate(function(x,z) f.s(z-x)*f.m(x),-Inf,Inf,z)$value
f.t <- Vectorize(f.t)
integrate(f.t, lower = 0, upper = 3.6)
R complains at the last step since the f.t function is bounded and my integration limits are probably not correct. Any ideas on how to solve this?
Here is one way.
f.X <- function(x) dnorm(x,1,0.5) # normal (mu=1.5, sigma=0.5)
f.Y <- function(y) dlnorm(y,1.5, 0.75) # log-normal (mu=1.5, sigma=0.75)
# convolution integral
f.Z <- function(z) integrate(function(x,z) f.Y(z-x)*f.X(x),-Inf,Inf,z)$value
f.Z <- Vectorize(f.Z) # need to vectorize the resulting fn.
set.seed(1) # for reproducible example
X <- rnorm(1000,1,0.5)
Y <- rlnorm(1000,1.5,0.75)
Z <- X + Y
# compare the methods
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
Same thing using package distr.
library(distr)
N <- Norm(mean=1, sd=0.5) # N is signature for normal dist
L <- Lnorm(meanlog=1.5,sdlog=0.75) # same for log-normal
conv <- convpow(L+N,1) # object of class AbscontDistribution
f.Z <- d(conv) # distribution function
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
I was having trouble getting integrate() to work for different density parameters, so I came up with an alternative to #jlhoward's using Riemann approximation:
set.seed(1)
#densities to be convolved. could also put these in the function below
d1 <- function(x) dnorm(x,1,0.5) #
d2 <- function(y) dlnorm(y,1.5, 0.75)
#Riemann approximation of convolution
conv <- function(t, a, b, d) { #a to b needs to cover the range of densities above. d needs to be small for accurate approx.
z <- NA
x <- seq(a, b, d)
for (i in 1:length(t)){
print(i)
z[i] <- sum(d1(x)*d2(t[i]-x)*d)
}
return(z)
}
#check against sampled convolution
X <- rnorm(1000, 1, 0.5)
Y <- rlnorm(1000, 1.5, 0.75)
Z <- X + Y
t <- seq(0, 50, 0.05) #range to evaluate t, smaller increment -> smoother curve
hist(Z, breaks = 50, freq = F, xlim = c(0,30))
lines(t, conv(t, -100, 100, 0.1), type = "s", col = "red")

Trouble estimating E[u(X)] by simulation when u() is exponential

I'm trying to use R to estimate E[u(X)] where u is a utility function and X is a random variable. More specifically, I want to be able to rank E[u(X)] and E[u(Y)] for two random variables X and Y -- only the ranking matters.
My problem is that u(x) = -exp(-sigma * x) for some sigma > 0, and this converges very rapidly to zero. So I have many cases where I expect, say, E[u(X)] > E[u(Y)], but because they are so close to zero, my simulation cannot distinguish them.
Does anyone have any advice for me?
I am only interested in ranking the two expected utilities, so u(x) can be replaced by any u.tilde(x) = a * u(x) + b, where a > 0 and b can be any number.
Below is an example where X and Y are both normal (in which case I think there is a closed form solution, but pretend X and Y have complicated distributions that I can only simulate from).
get.u <- function(sigma=1) {
stopifnot(sigma > 0)
utility <- function(x) {
return(-exp(-sigma * x))
}
return(utility)
}
u <- get.u(sigma=1)
curve(u, from=0, to=10) # Converges very rapidly to zero
n <- 10^4
x <- rnorm(n, 10^4, sd=10)
y <- rnorm(n, 10^4, sd=10^3)
mean(u(x)) == mean(u(y)) # Returns True (they're both 0), but I expect E[u(x)] > E[u(y)]
## An example of replacing u with a*u + b
get.scaled.u <- function(sigma=1) {
stopifnot(sigma > 0) # Risk averse
utility <- function(x) {
return(-exp(-sigma * x + sigma * 10^4))
}
return(utility)
}
u <- get.scaled.u(sigma=1)
mean(u(x)) > mean(u(y)) # True as desired
x <- rnorm(n, 10^4, sd=10^3)
y <- rnorm(n, 10^4, sd=2*10^3)
mean(u(x)) > mean(u(y)) # False again -- they're both -Inf
Is finding a clever way to scale u the correct way to deal with this problem? For example, suppose X and Y both have bounded support -- if I know the bounds, how can I scale u to guarantee that a*u + b will be neither too close to -Inf, nor too close to zero?
Edit: I didn't know about multiple precision packages. Rmpfr is helpful:
library(Rmpfr)
x.precise <- mpfr(x, 100)
y.precise <- mpfr(y, 100)
mean(u(x.precise)) > mean(u(y.precise)) # True

Resources