Generate Random Numbers with Std Dev x and Fixed Product - r

I want generate a series of returns x such that the standard deviation of the returns are say 0.03 and the product of 1+x = 1. To summarise, there are two conditions for the returns:
1) sd(x) == 0.03
2) prod(1+x) == 1
Is this possible and if so, how can I implement it in R?
Thank you.

A slightly more sophisticated approach is to use knowledge of the log-normal distribution: from ?dlnorm, Var= exp(2*mu + sigma^2)*(exp(sigma^2) - 1). We want the geometric mean to equal 1, so the mean on the log scale should be 0. We have Var = exp(sigma^2)*(exp(sigma^2)-1), can't obviously solve this analytically but we can use uniroot:
Find the correct log-variance:
vfun <- function(s2,v=0.03^2) { exp(s2)*(exp(s2)-1)-v }
s2 <- uniroot(vfun,interval=c(1e-6,100))$root
Generate values:
set.seed(1001)
x <- rnorm(1000,mean=0,sd=sqrt(s2))
x <- exp(x-mean(x))-1 ## makes sum(x) exactly zero
prod(1+x) ## exactly 1
sd(x)
This produces values with a standard deviation not exactly equal to 0.03, but close. If we wanted we could fix this too ...

A very simple approach is to simply simulate returns until you have a set that satisfies your requirements. You will need to specify a tolerance to your requirements, though (see here why).
nn <- 10
epsilon <- 1e-3
while ( TRUE ) {
xx <- rnorm(nn,0,0.03)
if ( abs(sd(xx)-0.03)<epsilon & abs(prod(1+xx)-1)<epsilon ) break
}
xx
yields
[1] 0.007862226 -0.011437600 -0.038740969 0.028614022 0.006986953
[6] -0.004131429 0.030846398 -0.037977057 0.046448318 -0.025294236

Related

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

R: draw from a vector using custom probability function

Forgive me if this has been asked before (I feel it must have, but could not find precisely what I am looking for).
Have can I draw one element of a vector of whole numbers (from 1 through, say, 10) using a probability function that specifies different chances of the elements. If I want equal propabilities I use runif() to get a number between 1 and 10:
ceiling(runif(1,1,10))
How do I similarly sample from e.g. the exponential distribution to get a number between 1 and 10 (such that 1 is much more likely than 10), or a logistic probability function (if I want a sigmoid increasing probability from 1 through 10).
The only "solution" I can come up with is first to draw e6 numbers from the say sigmoid distribution and then scale min and max to 1 and 10 - but this looks clumpsy.
UPDATE:
This awkward solution (and I dont feel it very "correct") would go like this
#Draw enough from a distribution, here exponential
x <- rexp(1e3)
#Scale probs to e.g. 1-10
scaler <- function(vector, min, max){
(((vector - min(vector)) * (max - min))/(max(vector) - min(vector))) + min
}
x_scale <- scaler(x,1,10)
#And sample once (and round it)
round(sample(x_scale,1))
Are there not better solutions around ?
I believe sample() is what you are looking for, as #HubertL mentioned in the comments. You can specify an increasing function (e.g. logit()) and pass the vector you want to sample from v as an input. You can then use the output of that function as a vector of probabilities p. See the code below.
logit <- function(x) {
return(exp(x)/(exp(x)+1))
}
v <- c(seq(1,10,1))
p <- logit(seq(1,10,1))
sample(v, 1, prob = p, replace = TRUE)

R optimize not giving the finite minimum but Inf when the search interval is wider

I have a problem with optimize().
When I limit the search in a small interval around zero, e.g., (-1, 1), the optimize algorithm gives a finite minimum with a finite objective function value.
But when I make the interval wider to (-10, 10), then the minimum is on the boundary of the interval and the objective is Inf, which is really puzzling for me.
How can this happen and how to fix this? Thanks a lot in advance.
The following is my code.
set.seed(123)
n <- 120
c <- rnorm(n,mean=1,sd=.3);
eps <- rnorm(n,mean=0,sd=5)
tet <- 32
r <- eps * c^tet
x <- matrix(c(c,r), ncol=2)
g <- function(tet, x){
matrix((x[,1]^(-tet))*x[,2],ncol=1)
}
theta <- 37
g_t <- g(theta,x)
f.tau <- function(tau){
exp.tau.g <- exp(g_t %*% tau)
g.exp <- NULL; i <- 1:n
g.exp <- matrix(exp.tau.g[i,] * g_t[i,], ncol=1)
sum.g.exp <- apply(g.exp,2,sum)
v <- t(sum.g.exp) %*% sum.g.exp
return(v)
}
band.tau <- 1;
f <- optimize(f.tau, c(-band.tau, band.tau), tol=1e-20)
print("interval=(-1, 1)"); print(f);
band.tau <- 10;
f <- optimize(f.tau, c(-band.tau, band.tau), tol=1e-20)
print("interval=(-10, 10)"); print(f);
The problem is that your function f.tau(x) is not well behaved. You can see that here:
vect.f <- Vectorize(f.tau)
z1 <- seq(-1,1,by=0.01)
z10 <- seq(-10,10,by=0.01)
par(mfrow=c(2,1), mar=c(2,2,1,1))
plot(z1, log(vect.f(z1)), type="l")
plot(z10,log(vect.f(z10)),type="l")
Note that these are plots of log(f.tau). So there are two problems: f.tau(...) has an extremely large slope on either side of the minimum, and f.tau = Inf for x<-0.6 and x>1.0, where Inf means that f.tau(...) is greater than the largest number that can be represented on this system. When you set the range to (-1,1) your starting point is close enough to the minimum that optimize(...) manages to converge. When you set the limits to (-10,10) the starting point is too far away. There are examples in the documentation which show a similar problem with functions that are not nearly as ill-behaved as f.tau.
EDIT (Response to OP's comment)
The main problem is that you are trying to optimize a function which has computational infinities in the interval of interest. Here's a way around that.
band.tau <- 10
z <- seq(-band.tau,band.tau,length=1000)
vect.f <- Vectorize(f.tau)
interval <- range(z[is.finite(vect.f(z))])
f <- optimize(f.tau, interval, tol=1e-20)
f
# $minimum
# [1] 0.001615433
#
# $objective
# [,1]
# [1,] 7.157212e-12
This evaluates f.tau(x) at 1000 equally spaced points on (-band.tau,+band.tau), identifies all the values of x where f.tau is finite, and uses the range as the increment in optimize(...). This works in your case because f.tau(x) does not (appear to...) have asymptotes.

Efficiently generating discrete random numbers

I want to quickly generate discrete random numbers where I have a known CDF. Essentially, the algorithm is:
Construct the CDF vector (an increasing vector starting at 0 and end at 1) cdf
Generate a uniform(0, 1) random number u
If u < cdf[1] choose 1
else if u < cdf[2] choose 2
else if u < cdf[3] choose 3
*...
Example
First generate an cdf:
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
Next generate N uniform random numbers:
N = 1000
u = runif(N)
Now sample the value:
##With some experimenting this seemed to be very quick
##However, with N = 100000 we run out of memory
##N = 10^6 would be a reasonable maximum to cope with
colSums(sapply(u, ">", cdf))
If you know the probability mass function (which you do, if you know the cumulative distribution function), you can use R's built-in sample function, where you can define the probabilities of discrete events with argument prob.
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
system.time(sample(size=1e6,x=1:10000,prob=c(cdf[1],diff(cdf)),replace=TRUE))
user system elapsed
0.01 0.00 0.02
How about using cut:
N <- 1e6
u <- runif(N)
system.time(as.numeric(cut(u,cdf)))
user system elapsed
1.03 0.03 1.07
head(table(as.numeric(cut(u,cdf))))
1 2 3 4 5 6
51 95 165 172 148 75
If you have a finite number of possible values then you can use findInterval or cut or better sample as mentioned by #Hemmo.
However, if you want to generate data from a distribution that that theoretically goes to infinity (like the geometric, negative binomial, Poisson, etc.) then here is an algorithm that will work (this will also work with a finite number of values if wanted):
Start with your vector of uniform values and loop through the distribution values subtracting them from the vector of uniforms, the random value is the iteration where the value goes negative. This is a easier to see whith an example. This generates values from a Poisson with mean 5 (replace the dpois call with your calculated values) and compares it to using the inverse CDF (which is more efficient in this case where it exists).
i <- 0
tmp <- tmp2 <- runif(10000)
randvals <- rep(0, length(tmp) )
while( any(tmp > 0) ) {
tmp <- tmp - dpois(i, 5)
randvals <- randvals + (tmp > 0)
i <- i + 1
}
randvals2 <- qpois( tmp2, 5 )
all.equal(randvals, randvals2)

Generate numbers in R

In R, how can I generate N numbers that have a mean of X and a median of Y (at least close to).
Or perhaps more generally, is there an algorithm for this?
There is an infinite number of solutions.
Approximate algorithm:
Generate n/2 numbers below the median
Generate n/2 numbers above the median
Add you desired median and check
Add one number with enough weight to satisfy your mean -- which you can solve
Example assuming you want a median of zero and a mean of twenty:
R> set.seed(42)
R> lo <- rnorm(10, -10); hi <- rnorm(10, 10)
R> median(c(lo,0,hi))
[1] 0 # this meets our first criterion
R> 22*20 - sum(c(lo,0,hi)) # (n+1)*desiredMean - currentSum
[1] 436.162 # so if we insert this, we the right answer
R> mean(c(lo,0,hi,22*20 - sum(c(lo,0,hi))))
[1] 20 # so we meet criterion two
R>
because desiredMean times (n+1) has to be equal to sum(currentSet) + x so we solve for x getting the expression above.
For a set of data that looks fairly 'normal', you can use the correction factor method as outlined by #Dirk-Eddelbuettel but with your custom values used to generate a set of data around your mean:
X = 25
Y = 25.5
N = 100
set.sd = 5 # if you want to set the standard deviation of the set.
set <- rnorm(N, Y, set.sd) # generate a set around the mean
set.left <- set[set < X] # take only the left half
set <- c(set.left, X + (X - set.left)) # ... and make a copy on the right.
# redefine the set, adding in the correction number and an extra number on the opposite side to the correction:
set <- c(set,
X + ((set.sd / 2) * sign(X - Y)),
((length(set)+ 2) * Y)
- sum(set, X + ((set.sd / 2) * sign(X - Y)))
)
Take strong heed of the first answer's first sentence. Unless you know what underlying distribution you want, you can't do it. Once you know that distribution, there are R-functions for many standards such as runif, rnorm, rchisq . You can create an arb. dist with the sample function.
If you are okay with the restriction X < Y, then you can fit a lognormal distribution. The lognormal conveniently has closed forms for both mean and median.
rmm <- function(n, X, Y) rlnorm(n, log(Y), sqrt(2*log(X/Y)))
E.g.:
z <- rmm(10000, 3, 1)
mean(z)
# [1] 2.866567
median(z)
# [1] 0.9963516

Resources