Generate numbers in R - r

In R, how can I generate N numbers that have a mean of X and a median of Y (at least close to).
Or perhaps more generally, is there an algorithm for this?

There is an infinite number of solutions.
Approximate algorithm:
Generate n/2 numbers below the median
Generate n/2 numbers above the median
Add you desired median and check
Add one number with enough weight to satisfy your mean -- which you can solve
Example assuming you want a median of zero and a mean of twenty:
R> set.seed(42)
R> lo <- rnorm(10, -10); hi <- rnorm(10, 10)
R> median(c(lo,0,hi))
[1] 0 # this meets our first criterion
R> 22*20 - sum(c(lo,0,hi)) # (n+1)*desiredMean - currentSum
[1] 436.162 # so if we insert this, we the right answer
R> mean(c(lo,0,hi,22*20 - sum(c(lo,0,hi))))
[1] 20 # so we meet criterion two
R>
because desiredMean times (n+1) has to be equal to sum(currentSet) + x so we solve for x getting the expression above.

For a set of data that looks fairly 'normal', you can use the correction factor method as outlined by #Dirk-Eddelbuettel but with your custom values used to generate a set of data around your mean:
X = 25
Y = 25.5
N = 100
set.sd = 5 # if you want to set the standard deviation of the set.
set <- rnorm(N, Y, set.sd) # generate a set around the mean
set.left <- set[set < X] # take only the left half
set <- c(set.left, X + (X - set.left)) # ... and make a copy on the right.
# redefine the set, adding in the correction number and an extra number on the opposite side to the correction:
set <- c(set,
X + ((set.sd / 2) * sign(X - Y)),
((length(set)+ 2) * Y)
- sum(set, X + ((set.sd / 2) * sign(X - Y)))
)

Take strong heed of the first answer's first sentence. Unless you know what underlying distribution you want, you can't do it. Once you know that distribution, there are R-functions for many standards such as runif, rnorm, rchisq . You can create an arb. dist with the sample function.

If you are okay with the restriction X < Y, then you can fit a lognormal distribution. The lognormal conveniently has closed forms for both mean and median.
rmm <- function(n, X, Y) rlnorm(n, log(Y), sqrt(2*log(X/Y)))
E.g.:
z <- rmm(10000, 3, 1)
mean(z)
# [1] 2.866567
median(z)
# [1] 0.9963516

Related

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

Unable to find outside of range value using R tool

Generate 500 random numbers between 0 to 100.
Find the sum of these 500 random numbers.
Repeat steps 1) and 2) above 1000 times by generating new set of random numbers.
Assuming Y denote the sum of 500 numbers, obtain Box-Whisker plot of random variable Y.
Display values of Y which are outside mean +/- 2 *SD where SD is standard deviation.
Which statistical distribution is justified for random variable Y.
For
y <- runif(500, min = 1, max = 100) # 1
sum(y) # 2
c <- runif(1000, min = 1, max = 100) # 3
sum(c) # 4
Above mention i manage to figure out answer, but not sure whether it is correct or not.
Please help me out.
This seems to be a homework task, but let me try to point you to the right direction.
Step 1. - 3. is creating the sum of random variables. Since there is no distribution given, we assume uniform distribution.
Y <- numeric(0) # sums are stored here
for (i in 1:1000) {
Y[i] <- sum(runif(500, min=0, max=100))
}
So Y contains 1000 sums of 500 uniform distrubuted random variables.
There is another way to create this Y:
Y <- sapply(1:1000, function(x) sum(runif(500, min=0, max=100)))
For steps 4 to 6 I assume you take a look at the R help for box plots (step 4/5) and histogramms (step 6). Try ?boxplot and ?hist.
Y <- replicate(1000, sum(runif(500, min=0, max=100)))
min_val = mean(Y) - 2*sd(Y)
max_val = mean(Y) + 2*sd(Y)
Y_min <- Y[Y < min_val]
Y_max <- Y[Y > max_val]
boxplot(Y, range=1)
points(rep(1,length(Y_min)), Y_min, pch=23, col="red")
points(rep(1,length(Y_max)), Y_max, pch=23, col="blue")
You get an answer for step 6 if you understand the mathmatics. Perhaps a central limit theorem gives you some insight.

How to generate samples from MVN model?

I am trying to run some code on R based on this paper here through example 5.1. I want to simulate the following:
My background on R isn't great so I have the following code below, how can I generate a histogram and samples from this?
xseq<-seq(0, 100, 1)
n<-100
Z<- pnorm(xseq,0,1)
U<- pbern(xseq, 0.4, lower.tail = TRUE, log.p = FALSE)
Beta <- (-1)^U*(4*log(n)/(sqrt(n)) + abs(Z))
Some demonstrations of tools that will be of use:
rnorm(1) # generates one standard normal variable
rnorm(10) # generates 10 standard normal variables
rnorm(1, 5, 6) # generates 1 normal variable with mu = 5, sigma = 6
# not needed for this problem, but perhaps worth saying anyway
rbinom(5, 1, 0.4) # generates 5 Bernoulli variables that are 1 w/ prob. 0.4
So, to generate one instance of a beta:
n <- 100 # using the value you gave; I have no idea what n means here
u <- rbinom(1, 1, 0.4) # make one Bernoulli variable
z <- rnorm(1) # make one standard normal variable
beta <- (-1)^u * (4 * log(n) / sqrt(n) + abs(z))
But now, you'd like to do this many times for a Monte Carlo simulation. One way you might do this is by building a function, having beta be its output, and using the replicate() function, like this:
n <- 100 # putting this here because I assume it doesn't change
genbeta <- function(){ # output of this function will be one copy of beta
u <- rbinom(1, 1, 0.4)
z <- rnorm(1)
return((-1)^u * (4 * log(n) / sqrt(n) + abs(z)))
}
# note that we don't need to store beta anywhere directly;
# rather, it is just the return()ed value of the function we defined
betadraws <- replicate(5000, genbeta())
hist(betadraws)
This will have the effect of making 5000 copies of your beta variable and putting them in a histogram.
There are other ways to do this -- for instance, one might just make a big matrix of the random variables and work directly with it -- but I thought this would be the clearest approach for starting out.
EDIT: I realized that I ignored the second equation entirely, which you probably didn't want.
We've now made a vector of beta values, and you can control the length of the vector in the first parameter of the replicate() function above. I'll leave it as 5000 in my continued example below.
To get random samples of the Y vector, you could use something like:
x <- replicate(5000, rnorm(17))
# makes a 17 x 5000 matrix of independent standard normal variables
epsilon <- rnorm(17)
# vector of 17 standard normals
y <- x %*% betadraws + epsilon
# y is now a 17 x 1 matrix (morally equivalent to a vector of length 17)
and if you wanted to get many of these, you could wrap that inside another function and replicate() it.
Alternatively, if you didn't want the Y vector, but just a single Y_i component:
x <- rnorm(5000)
# x is a vector of 5000 iid standard normal variables
epsilon <- rnorm(1)
# epsilon_i is a single standard normal variable
y <- t(x) %*% betadraws + epsilon
# t() is the transpose function; y is now a 1 x 1 matrix

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

Generate Random Numbers with Std Dev x and Fixed Product

I want generate a series of returns x such that the standard deviation of the returns are say 0.03 and the product of 1+x = 1. To summarise, there are two conditions for the returns:
1) sd(x) == 0.03
2) prod(1+x) == 1
Is this possible and if so, how can I implement it in R?
Thank you.
A slightly more sophisticated approach is to use knowledge of the log-normal distribution: from ?dlnorm, Var= exp(2*mu + sigma^2)*(exp(sigma^2) - 1). We want the geometric mean to equal 1, so the mean on the log scale should be 0. We have Var = exp(sigma^2)*(exp(sigma^2)-1), can't obviously solve this analytically but we can use uniroot:
Find the correct log-variance:
vfun <- function(s2,v=0.03^2) { exp(s2)*(exp(s2)-1)-v }
s2 <- uniroot(vfun,interval=c(1e-6,100))$root
Generate values:
set.seed(1001)
x <- rnorm(1000,mean=0,sd=sqrt(s2))
x <- exp(x-mean(x))-1 ## makes sum(x) exactly zero
prod(1+x) ## exactly 1
sd(x)
This produces values with a standard deviation not exactly equal to 0.03, but close. If we wanted we could fix this too ...
A very simple approach is to simply simulate returns until you have a set that satisfies your requirements. You will need to specify a tolerance to your requirements, though (see here why).
nn <- 10
epsilon <- 1e-3
while ( TRUE ) {
xx <- rnorm(nn,0,0.03)
if ( abs(sd(xx)-0.03)<epsilon & abs(prod(1+xx)-1)<epsilon ) break
}
xx
yields
[1] 0.007862226 -0.011437600 -0.038740969 0.028614022 0.006986953
[6] -0.004131429 0.030846398 -0.037977057 0.046448318 -0.025294236

Resources