Generating bivariate data where x variable is uniformly distributed between 0 and 1 and Y is normally distributed with mean 1/x with some noise - r

I used x <- c(runif(100, 0, 1)) to generate 100 x's between 0 and 1.
Now for each of the x's I am trying to generate 10 y's with mean 1/x and variance of 1.
Preferably stored in a matrix and so if I was to plot the 1000 points on y and x, it would look like the graph y = 1/x + some error.
Any help would be greatly appreciated.

If you want the data in a matrix, then you can do
x <- runif(100, 0, 1)
y <- sapply(x, function(m) rnorm(10, 1/m, 1))
This uses sapply to generate 10 normal values for each x value.
If you wanted one, two-column, matrix, then maybe
points <- do.call("rbind", lapply(x, function(m) cbind(x=m, y=rnorm(10, 1/m, 1))))
is what you want. You can plot that with
plot(y~x, points)

Related

How can I calculate the second-order derivative of a vector using finite differences if the interval is non-constant?

Say I have vectors x and y and want to calculate the second derivative of y with respect to x using finite differences.
I'd do
x <- rnorm(2000)
y <- x^2
y = y[order(x)]
x = sort(x)
dydx = diff(y) / diff(x)
d2ydx2 = c(NA, NA, diff(dydx) / diff(x[-1]))
plot(x, d2ydx2)
As you can see, there are a few points which are wildly inaccurate. I believe the problem arises because values in dydx do not exactly correspond to those of x[-1] leading a second differentiation to have inaccurate results. Since the step in x is non-constant, the second-order differentiation is not straight forward. How can I do this?
Each time you are taking the numerical approximation derivative, you are losing one value in the vector and shifting the output value over one spot. You are correct, the error is due to the uneven spacing in the x values (incorrect divisor in dydx & d2ydx2 calculations).
In order to correct, calculate a new set of x values corresponding to the mid point between the adjacent x values at each derivative. This is the value where the slope is calculated.
Thus y'1 = f'((x1+x2)/2).
This method is not perfect but the resulting error is much smaller.
#create the input
x <- sort(rnorm(2000))
y <- x**2
#calculate the first deriative and the new mean x value
xprime <- x[-1] - diff(x)/2
dydx <- diff(y)/diff(x)
#calculate the 2nd deriative and the new mean x value
xpprime <- xprime[-1] - diff(xprime)/2
d2ydx2 <- diff(dydx)/diff(xprime)
plot(xpprime, d2ydx2)
Another way is using splinefun, which returns a function from which you can calculate cubic spline derivatives.
Of course, given your example function y= x^2 the second derivatives will be always 2
x <- rnorm(2000)
y <- x^2
y = y[order(x)]
x = sort(x)
fun = splinefun(x,y)
plot(x,fun(x,deriv=2))

Unable to find outside of range value using R tool

Generate 500 random numbers between 0 to 100.
Find the sum of these 500 random numbers.
Repeat steps 1) and 2) above 1000 times by generating new set of random numbers.
Assuming Y denote the sum of 500 numbers, obtain Box-Whisker plot of random variable Y.
Display values of Y which are outside mean +/- 2 *SD where SD is standard deviation.
Which statistical distribution is justified for random variable Y.
For
y <- runif(500, min = 1, max = 100) # 1
sum(y) # 2
c <- runif(1000, min = 1, max = 100) # 3
sum(c) # 4
Above mention i manage to figure out answer, but not sure whether it is correct or not.
Please help me out.
This seems to be a homework task, but let me try to point you to the right direction.
Step 1. - 3. is creating the sum of random variables. Since there is no distribution given, we assume uniform distribution.
Y <- numeric(0) # sums are stored here
for (i in 1:1000) {
Y[i] <- sum(runif(500, min=0, max=100))
}
So Y contains 1000 sums of 500 uniform distrubuted random variables.
There is another way to create this Y:
Y <- sapply(1:1000, function(x) sum(runif(500, min=0, max=100)))
For steps 4 to 6 I assume you take a look at the R help for box plots (step 4/5) and histogramms (step 6). Try ?boxplot and ?hist.
Y <- replicate(1000, sum(runif(500, min=0, max=100)))
min_val = mean(Y) - 2*sd(Y)
max_val = mean(Y) + 2*sd(Y)
Y_min <- Y[Y < min_val]
Y_max <- Y[Y > max_val]
boxplot(Y, range=1)
points(rep(1,length(Y_min)), Y_min, pch=23, col="red")
points(rep(1,length(Y_max)), Y_max, pch=23, col="blue")
You get an answer for step 6 if you understand the mathmatics. Perhaps a central limit theorem gives you some insight.

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

Extract approximate probability density function (pdf) in R from random sampling

I have got n>2 independent continuous Random Variables(RV). For example say I have 4 Uniform RVs with different set of Upper and lowers.
W~U[-1,5], X~U[0,1], Y~[0,2], Z~[0.5,2]
I am trying to find out the approximate PDF for the sum of these RVs i.e. for T=W+X+Y+Z. As I don't need any closed form solution, I have sampled 1 million points for each of them to get 1 million samples for T. Is it possible in R to get the approximate PDF function or a way to get approximate probability of P(t<T)from this samples I have drawn. For example is there a easy way I can calculate P(0.5<T) in R. My priority here is to get probability first even if getting the density function is not possible.
Thanks
Consider the ecdf function:
set.seed(123)
W <- runif(1e6, -1, 5)
X <- runif(1e6, 0, 1)
Y <- runif(1e6, 0, 2)
Z <- runif(1e6, 0.5, 2)
T <- Reduce(`+`, list(W, X, Y, Z))
cdfT <- ecdf(T)
1 - cdfT(0.5) # Pr(T > 0.5)
# [1] 0.997589
See How to calculate cumulative distribution in R? for more details.

Computing Euclidean Distance whilst holding point A constant and changing point B in R

I am currently working on a project for which I am interested in calculating the distance between the location of a basketball player and the ball during an event.
To do this I created the following function:
## Euclidean distance
distance <- function(x,y){
x2 <- (x[i]-x[j])^2
y2 <- (y[i]-y[j])^2
dis <- sqrt(x2+y2)
}
What I want to achieve is to calculate the distance between the basketball and the players, and then repeat this process for each time frame of data I have. So for each time frame x1 and y1 would have to be constant whilst x[j] and y[j] would keep going from 2 to 11. I thought of this nested for loop, but it is giving me a constant result of 28.34639. I added a link to an image of a sample of my data frame. Data Frame Sample
for(i in i:length(all.movement$x_loc)){
for(j in j:11){
all.movement$distance[j] <- distance(all.movement$x_loc, all.movement$y_loc)
}
i <- i + 11
}
I would really appreciate some help with this problem.
I'd go about:
set.seed(101)
x <- rnorm(30, 10, 5) # x coordinate
y <- rnorm(30, 15, 7) # y coordinate
df <- data.frame(x, y) # sample data.frame
i = 0
for (i in i:length(df$x)) {
df$distance <- sqrt((x - 5)^2 + (y + 4)^2)} # assume basket coordinates (5, -4)
df # output

Resources