R - random distribution with predefined min, max, mean, and sd values - r

I want to generate a random distribution of say 10,000 numbers with predefined min, max, mean, and sd values. I have followed this link setting upper and lower limits in rnorm to get random distribution with fixed min and max values. However, in doing so, mean value changes.
For example,
#Function to generate values between a lower limit and an upper limit.
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
stop(simpleError("Not enough values to sample from. Try increasing nnorm."))
}
Account_Value <- mysamp(n=10000, m=1250000, s=4500000, lwr=50000, upr=5000000, nnorm=1000000)
summary(Account_Value)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 50060 1231000 2334000 2410000 3582000 5000000
#Note - though min and max values are good, mean value is very skewed for an obvious reason.
# sd(Account_Value) # 1397349
I am not sure whether we can generate a random normal distribution that meets all conditions. If there is any other sort of random distribution that can meet all conditions, please do share too.
Look forward to your inputs.
-Thank you.

You could use a generalized form of the beta distribution, known as the Pearson type I distribution. The standard beta distribution is defined on the interval (0,1), but you can take a linear transformation of a standard beta distributed variable to obtain values between any (min, max). The answer to this question on CrossValidated explains how to parameterize a beta distribution with its mean and variance, with certain constraints.
While it's possible to formulate both a truncated normal and a generalized beta distribution with the desired min, max, mean and sd, the shape of the two distributions will be very different. This is because the truncated normal distribution has a positive probability density at the endpoints of its support interval, while in the generalized beta distribution the density will always fall smoothly to zero at the endpoints. Which shape is more preferable will depend on your intended application.
Here's an implementation in R for generating generalized beta distributed observations with a mean, variance, min and max parameterization.
rgbeta <- function(n, mean, var, min = 0, max = 1)
{
dmin <- mean - min
dmax <- max - mean
if (dmin <= 0 || dmax <= 0)
{
stop(paste("mean must be between min =", min, "and max =", max))
}
if (var >= dmin * dmax)
{
stop(paste("var must be less than (mean - min) * (max - mean) =", dmin * dmax))
}
# mean and variance of the standard beta distributed variable
mx <- (mean - min) / (max - min)
vx <- var / (max - min)^2
# find the corresponding alpha-beta parameterization
a <- ((1 - mx) / vx - 1 / mx) * mx^2
b <- a * (1 / mx - 1)
# generate standard beta observations and transform
x <- rbeta(n, a, b)
y <- (max - min) * x + min
return(y)
}
set.seed(1)
n <- 10000
y <- rgbeta(n, mean = 1, var = 4, min = -4, max = 5)
sapply(list(mean, sd, min, max), function(f) f(y))
# [1] 0.9921269 2.0154131 -3.8653859 4.9838290

Discussion:
Hi. It is very interesting problem. It needs quite an effort to be solved properly and not always solution can be found.
First thing is that when you truncate a distribution (set a min and max for it) standard deviation is limited (has a maximum depending on min and max values). If you want too big value of it - you can not get it.
Second restriction limits mean. It is obvious that if you want mean below minimum and above maximum it will not work, but you may want something too close to limits and still it can not be satisfied.
Third restriction limits a combination of this parameters. Im not sure how does it work, but i am pretty sure not all the combinations may be satisfied.
But there are some combinations that may work and may be found.
Solution:
The problem is: what are the parameters: mean and sd of truncated (cut) distribution with defined limits a and b, so in the end the mean will be equal to desired_mean and standard deviation will be equal to desired_sd.
It is important that values of parameters: mean and sd are used before truncation. So that is why in the end mean and deviation are diffrent.
Below is the code that solves the problem using function optim(). It may not be the best solution for this problem, but it generally works:
require(truncnorm)
eval_function <- function(mean_sd){
mean <- mean_sd[1]
sd <- mean_sd[2]
sample <- rtruncnorm(n = n, a = a, b = b, mean = mean, sd = sd)
mean_diff <-abs((desired_mean - mean(sample))/desired_mean)
sd_diff <- abs((desired_sd - sd(sample))/desired_sd)
mean_diff + sd_diff
}
n = 1000
a <- 1
b <- 6
desired_mean <- 3
desired_sd <- 1
set.seed(1)
o <- optim(c(desired_mean, desired_sd), eval_function)
new_n <- 10000
your_sample <- rtruncnorm(n = new_n, a = a, b = b, mean = o$par[1], sd = o$par[2])
mean(your_sample)
sd(your_sample)
min(your_sample)
max(your_sample)
eval_function(c(o$par[1], o$par[2]))
I am very interested if there are other solutions to that problem, so please post them if you find other answers.
EDIT:
#Mikko Marttila: Thanks to your comment and link: Wikipedia I implemented formulas to calculate mean and sd of truncated distribution. Now the solution is WAY more elegant and it should calculate quite accurately mean and sd of the desired distribution if they exist. It works much faster also.
I implemented eval_function2 which should be used in the optim() function instead of previous one:
eval_function2 <- function(mean_sd){
mean <- mean_sd[1]
sd <- mean_sd[2]
alpha <- (a - mean)/sd
betta <- (b - mean)/sd
trunc_mean <- mean + sd * (dnorm(alpha, 0, 1) - dnorm(betta, 0, 1)) /
(pnorm(betta, 0, 1) - pnorm(alpha, 0, 1))
trunc_var <- (sd ^ 2) *
(1 +
(alpha * dnorm(alpha, 0, 1) - betta * dnorm(betta, 0, 1))/
(pnorm(betta, 0, 1) - pnorm(alpha, 0, 1)) -
(dnorm(alpha, 0, 1) - dnorm(betta, 0, 1))/
(pnorm(betta, 0, 1) - pnorm(alpha, 0, 1)))
trunc_sd <- trunc_var ^ 0.5
mean_diff <-abs((desired_mean - trunc_mean)/desired_mean)
sd_diff <- abs((desired_sd - trunc_sd)/desired_sd)
}

Related

Normal distribution in R (what values for mean and sd?)

I have to make a normal distribution from a set of pre-established data, henceforth xvec. So, I know I need to use dnorm(xvec,meanvec,sdvec). But what values I put for mean and sd? Can I put always meanvec = mean(xvec) and sdvec = sd(xvec)? Is it a reasonable way? Or is it preferable let the default values of mean=0 and sd=1?
I'm asking this because I looked some examples and the values for mean and sd alwayes were chosen before. For example, this one, from https://www.tutorialspoint.com/r/r_normal_distribution.htm:
Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)
Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)
Why did he put mean=2.5 and sd=0.5, once
> mean(x)
[1] 5.105265e-16
> sd(x)
[1] 5.816786?

Generate a matrix with certain values such that its standard deviation is 1?

I'm currently going through an 'Introduction to R' book and I am completely stuck at the following question:
Create a 5x5 matrix (M), all its entries drawn from the uniform distribution, with sd 1 and mean being the column number of the element. (so mean(matrix[,I]) == column(i), sd(matrix) == 1)
I have to make use of the sapply() function.
I was thinking about something like this:
m <- matrix(runif(25), nrow = 5, ncol = 50
sapply(matrix, function(x) sd(x) == 1)
But that part already doesn't work and I'm just stuck.
Help would be appreciated!
The mean can be set by the following:
my_uniform <- function(col_nbr) {
runif(5, min = col_nbr-sqrt(12)/2, max=col_nbr+sqrt(12)/2)
}
M <- sapply(1:5, my_uniform)
This will lead to std=1 for each column and the mean is set to the number of column in each column. The formular for mean is:
The formular for the sdt is:
From the random uniform distribution one can only simulate values between a range with the same probability each one, being the expected mean when n goes to infinity to be the mean value between the min and the max.
From the point of view of a uniform distribution, the mean and the standard deviation cannot be defined in the function. What you can do is simulate such that the middle value (i.e. the mean) would be the number you are expecting, but the standard deviation would not be 1:
set.seed(1)
numrow<-5
numcol<-5
Mat<-matrix(NA, nrow = numrow, ncol = numcol)
for(i in 1:numcol){
Mat[,i]<- runif(numrow, min = i-0.5, max = i+0.5)
}
Mat
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.7655087 2.398390 2.705975 3.997699 5.434705
# [2,] 0.8721239 2.444675 2.676557 4.217619 4.712143
# [3,] 1.0728534 2.160798 3.187023 4.491906 5.151674
# [4,] 1.4082078 2.129114 2.884104 3.880035 4.625555
# [5,] 0.7016819 1.561786 3.269841 4.277445 4.767221
To see the formulas of the expected mean and expected variance (therefore the standard deviation) I refer to https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)
This should now be the correct way to define the uniform distribution. If the mean is defined as mean=0.5*(a+b) then defining the upper limit like this will result in a mean of the column number.
sapply(1:5, function(x){runif(5, min = 0, max = x*2)})
See this little MonteCarlo experiment:
mean(runif(50000, min = 0, max = 1*2))
You gotta find the pdf ranges (a, b) that fit each mean, sd pair first. The mean of a uniform dist is
mu <- (b + a) / 2 The mu values are indexed from 1:5.
The sd of a uniform dist is (b - a) / sqrt(12)
The sd is fixed at 1, so use the sd equation to solve for b.
Then plug in b in the mu equation to solve for a
Now you have the a, b parameters of the uniform dist
The sapply function then looks like this:
z <- sapply(1:5, function(x) runif(5, 2*x - (2*x + sqrt(12)/2), (2*x + sqrt(12)/2)))
Run summary(z) will give you the output stats. Because of the small sample size the sample means will be off. To test, change the runif sample size from 5 to 100000. Then run summary(z) again. You will see that the values converge to the index means.

Ploting a skewed normal distribution in R

How can I plot a skewed normal distribution in R, given the number of cases, the mean, standard deviation, median and the MAD.
A example would be that I have 1'196 cases, were the mean cost is 6'389, the standard deviation 5'158, the median 4'930 and the MAD 1'366. And we know that the billed case always cost something, so the cost must always be positive.
The best answer to this problem I could find is from https://math.stackexchange.com/a/17995/54064 and recommends the usage of the sn package. However I could not figure out how to use it for my concrete use case.
I've had some success with fGarch package.
require("fGarch")
hist(rsnorm(1000, mean = 0, sd = 1, xi = 15))
mmm <- replicate(300, {
x <- rsnorm(1196, mean = 6389, sd = 5158, xi = 15)
c(mean = mean(x), sd = sd(x))
})
> mean(mmm[1, ])
[1] 6404.312
> mean(mmm[2, ])
[1] 5169.572

Generating normal distribution data within range 0 and 1

I am working on my project about the income distribution... I would like to generate random data for testing the theory. Let say I have N=5 countries and each country has n=1000 population and i want to generate random income (NORMAL DISTRIBUTION) for each person in each population with the constraint of income is between 0 and 1 and at same mean and DIFFERENT standard deviation for all countries. I used the function rnorm(n, meanx, sd) to do it. I know that UNIFORM DISTRIBUTION (runif(n,min, max) has some arguments for setting min, max, but no rnorm. Since rnorm doesn't provide the argument for setting min and max value. I have to write a piece of code to check the set of random data to see whether they satisfy my constraints of [0,1] or not.
I successfully generated income data for n=100. However, if i increase n = k times of 100, for eg. n=200, 300 ......1000. My programme is hanging. I can see why the programs is hanging, since it just generate data randomly without constraints of min, max. Therefore, when I do with larger n, the probabilities that i will generate successfully is less than with n=100. And the loop just running again : generate data, failed check.
Technically speaking, to fix this problem, I think of breaking n=1000 into small batches, let say b=100. Since rnorm successfully generate with 100 samples in range [0,1] and it is NORMAL DISTRIBUTION, it will work well if i run the loop of 10 times of 100samples separately for each batch of 100 samples. And then, I will collect all data of 10 * 100 samples into one data of 1000 for my later analysis.
However, mathematically speakign, I am NOT SURE whether the constrain of NORMAL DISTRIBUTION for n=1000 is still satisfied or not by doing this way. I attached here my code. Hopefully my explanation is clear to you. All of your opinions will be very useful to my work. Thanks a lot.
# Update:
# plot histogram
# create the random data with same mean, different standard deviation and x in range [0,1]
# Generate the output file
# Generate data for K countries
#---------------------------------------------
# Configurable variables
number_of_populations = 5
n=100 #number of residents (*** input the number whish is k times of 100)
meanx = 0.7
sd_constant = 0.1 # sd = sd_constant + j/50
min=0 #min income
max=1 #max income
#---------------------------------------------
batch =100 # divide the large number of residents into small batch of 100
x= matrix(
0, # the data elements
nrow=n, # number of rows
ncol=number_of_populations, # number of columns
byrow = TRUE) # fill matrix by rows
x_temp = rep(0,n)
# generate income data randomly for each country
for (j in 1:number_of_populations){
# 1. Generate uniform distribution
#x[,j] <- runif(n,min, max)
# 2. Generate Normal distribution
sd = sd_constant+j/50
repeat
{
{
x_temp <- rnorm(n, meanx, sd)
is_inside = TRUE
for (i in 1:n){
if (x_temp[i]<min || x_temp[i] >max) {
is_inside = FALSE
break
}
}
}
if(is_inside==TRUE) {break}
} #end repeat
x[,j] <- x_temp
}
# write in csv
# each column stores different income of its residents
working_dir= "D:\\dataset\\"
setwd(working_dir)
file_output = "random_income.csv"
sink(file_output)
write.table(x,file=file_output,sep=",", col.names = F, row.names = F)
sink()
file.show(file_output) #show the file in directory
#plot histogram of x for each population
#par(mfrow=c(3,3), oma=c(0,0,0,0,0))
attach(mtcars)
par(mfrow=c(1,5))
for (j in 1:number_of_populations)
{
#plot(X[,i],y,'xlab'=i)
hist(x[,j],main="Normal",'xlab'=j)
}
Here's a sensible simple way...
sampnorm01 <- function(n) qnorm(runif(n,min=pnorm(0),max=pnorm(1)))
Test it out:
mysamp <- sampnorm01(1e5)
hist(mysamp)
Thanks to #PatrickPerry, here is a generalized truncated normal, again using the inverse CDF method. It allows for different parameters on the normal and different truncation bounds.
rtnorm <- function(n, mean = 0, sd = 1, min = 0, max = 1) {
bounds <- pnorm(c(min, max), mean, sd)
u <- runif(n, bounds[1], bounds[2])
qnorm(u, mean, sd)
}
Test it out:
mysamp <- rtnorm(1e5, .7, .2)
hist(mysamp)
You can normalize the data:
x = rnorm(100)
# normalize
min.x = min(x)
max.x = max(x)
x.norm = (x - min.x)/(max.x - min.x)
print(x.norm)
Here is my take on it.
The data is first normalized (at which stage the standard deviation is lost). After that, it is fitted to the range specified by the lower and upper parameters.
#' Creates a random normal distribution within the specified bounds
#'
#' WARNING: This function does not preserve the standard deviation
#' #param n The number of values to be generated
#' #param mean The mean of the distribution
#' #param sd The standard deviation of the distribution
#' #param lower The lower limit of the distribution
#' #param upper The upper limit of the distribution
rtnorm <- function(n, mean = 0, sd = 1, lower = -1, upper = 1){
mean = ifelse(test = (is.na(mean)|| (mean < lower) || (mean > upper)),
yes = mean(c(lower, upper)),
no = mean)
data <- rnorm(n, mean = mean, sd = sd) # data
if (!is.na(lower) && !is.na(upper)){ # adjust data to specified range
drange <- range(data) # data range
irange <- range(lower, upper) # input range
data <- (data - drange[1]) / (drange[2] - drange[1]) # normalize data (make it 0 to 1)
data <- (data * (irange[2] - irange[1])) + irange[1] # adjust to specified range
}
return(data)
}
Example:
a <- rtnorm(n = 1000, lower = 10, upper = 90)
range(a)
plot(hist(a, 50))

Generating samples from a two-Gaussian mixture in r (code given in MATLAB)

I'm trying to create (in r) the equivalent to the following MATLAB function that will generate n samples from a mixture of N(m1,(s1)^2) and N(m2, (s2)^2) with a fraction, alpha, from the first Gaussian.
I have a start, but the results are notably different between MATLAB and R (i.e., the MATLAB results give occasional values of +-8 but the R version never even gives a value of +-5). Please help me sort out what is wrong here. Thanks :-)
For Example:
Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one.
MATLAB
function
function y = gaussmix(n,m1,m2,s1,s2,alpha)
y = zeros(n,1);
U = rand(n,1);
I = (U < alpha)
y = I.*(randn(n,1)*s1+m1) + (1-I).*(randn(n,1)*s2 + m2);
implementation
P = gaussmix(1000,0,0,1,6,.95)
P = (P-mean(P))/std(P)
plot(P)
axis([0 1000 -15 15])
hist(P)
axis([-15 15 0 1000])
resulting plot
resulting hist
R
yn <- rbinom(1000, 1, .95)
s <- rnorm(1000, 0 + 0*yn, 1 + 36*yn)
sn <- (s-mean(s))/sd(s)
plot(sn, xlim=range(0,1000), ylim=range(-15,15))
hist(sn, xlim=range(-15,15), ylim=range(0,1000))
resulting plot
resulting hist
As always, THANK YOU!
SOLUTION
gaussmix <- function(nsim,mean_1,mean_2,std_1,std_2,alpha){
U <- runif(nsim)
I <- as.numeric(U<alpha)
y <- I*rnorm(nsim,mean=mean_1,sd=std_1)+
(1-I)*rnorm(nsim,mean=mean_2,sd=std_2)
return(y)
}
z1 <- gaussmix(1000,0,0,1,6,0.95)
z1_standardized <- (z1-mean(z1))/sqrt(var(z1))
z2 <- gaussmix(1000,0,3,1,1,0.80)
z2_standardized <- (z2-mean(z2))/sqrt(var(z2))
z3 <- rlnorm(1000)
z3_standardized <- (z3-mean(z3))/sqrt(var(z3))
par(mfrow=c(2,3))
hist(z1_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 95% of N(0,1) and 5% of N(0,36)",
col="blue",xlab=" ")
hist(z2_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 80% of N(0,1) and 10% of N(3,1)",
col="blue",xlab=" ")
hist(z3_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of samples of LN(0,1)",col="blue",xlab=" ")
##
plot(z1_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(0,36)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z2_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(3,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z3_standardized,type='l',
main="1000 samples from LN(0,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
There are two problems, I think ... (1) your R code is creating a mixture of normal distributions with standard deviations of 1 and 37. (2) By setting prob equal to alpha in your rbinom() call, you're getting a fraction alpha in the second mode rather than the first. So what you are getting is a distribution that is mostly a Gaussian with sd 37, contaminated by a 5% mixture of Gaussian with sd 1, rather than a Gaussian with sd 1 that is contaminated by a 5% mixture of a Gaussian with sd 6. Scaling by the standard deviation of the mixture (which is about 36.6) basically reduces it to a standard Gaussian with a slight bump near the origin ...
(The other answers posted here do solve your problem perfectly well, but I thought you might be interested in a diagnosis ...)
A more compact (and perhaps more idiomatic) version of your Matlab gaussmix function (I think runif(n)<alpha is slightly more efficient than rbinom(n,size=1,prob=alpha) )
gaussmix <- function(n,m1,m2,s1,s2,alpha) {
I <- runif(n)<alpha
rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}
set.seed(1001)
s <- gaussmix(1000,0,0,1,6,0.95)
Not that you asked for it, but the mclust package offers a way to generalize your problem to more dimensions and diverse covariance structures. See ?mclust::sim. The example task would be done this way:
require(mclust)
simdata = sim(modelName = "V",
parameters = list(pro = c(0.95, 0.05),
mean = c(0, 0),
variance = list(modelName = "V",
d = 1,
G = 2,
sigmasq = c(0, 36))),
n = 1000)
plot(scale(simdata[,2]), type = "h")
I recently wrote the density and sampling function of a multinomial mixture of normal distributions:
dmultiNorm <- function(x,means,sds,weights)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
N <- length(x)
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
weights <- weights/sum(weights)
dens <- numeric(N)
for (i in 1:n)
{
dens <- dens + weights[i] * dnorm(x,means[i],sds[i])
}
return(dens)
}
rmultiNorm <- function(N,means,sds,weights,scale=TRUE)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
Res <- numeric(N)
for (i in 1:N)
{
s <- sample(1:n,1,prob=weights)
Res[i] <- rnorm(1,means[s],sds[s])
}
return(Res)
}
With means being a vector of means, sds being a vector of standard deviatians and weights being a vector with proportional probabilities to sample from each of the distributions. Is this useful to you?
Here is code to do this task:
"For Example: Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one."
plot(multG <- c( rnorm(950), rnorm(50, 0, 36))[sample(1000)] , type="h")
scmulG <- scale(multG)
summary(scmulG)
#-----------
V1
Min. :-9.01845
1st Qu.:-0.06544
Median : 0.03841
Mean : 0.00000
3rd Qu.: 0.13940
Max. :12.33107

Resources