How to generate normally distributed random numbers in specific interval? - r

I want to generate 100 normally distributed random number in interval [-50,50]. However in the below code the range of random number generated is [-50,50].
n <- rnorm(100, -50,50)
plot(n)

Your question is atrangely asked, because it seems you don't fully understand the rnorm function.
rnorm(100, -50,50)
generates a sample of 100 points given by a normal distribution centered on -50, with a standard deviation of 50. So you need to specifiy what you want by :
100 normally distributed random number in interval [-50,50]. In a normal distribution you don't give an upper and lower limit : the probability of drawing is never 0, but is just very low when being several standard deviation away from the mean. So:
Or you want a normal distribution centered on 0 with 50 standard deviation, and the answer is rnorm(100, 0,50), but you will have values above 50 and below -50.
Or you actually want a normal distribution with no value outside the [-50,50] range, and in this case you still need to give a standard deviation, and you will need to cut the values draw outside the range. You could do something like:
sd <- 50
n <- data.frame(draw = rnorm(1000, 0,sd))
final <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
Here is an example of what it does for 2 different sd:
sd <- 10
n1 <- data.frame(draw = rnorm(1000, 0,sd))
final1 <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
sd <- 50
n2 <- data.frame(draw = rnorm(1000, 0,sd))
final2 <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
par(mfrow = c(1,2))
hist(final1,main = "sd = 10")
hist(final2,main = "sd = 50")
or you just want to sample values in this range with a flat distribution. In this case, just sample(-50:50,100,replace = T)

You have to make a sacrifice. Either your random variable is not normally distributed because the tails are cut off, or you compromise on the boundaries. You can define your random variable to "practically" lie in a range, this is you accept that a very small percentage lies outside. Maybe 1 % would be an acceptable choice for your purpose.
my_range <- setNames(c(-50, 50), c("lower", "upper"))
prob <- 0.01 # probability to lie outside of my_range
# you have to define this, 1 % in this case
my <- mean(my_range)
z_value <- qnorm(prob/2)
sigma <- (my - my_range["lower"]) / (-1 * z_value)
# proof
N <- 100000 # large number
sim_vec <- rnorm(N, my, sigma)
chk <- 1 - length(sim_vec[sim_vec >= my_range["lower"] &
sim_vec <= my_range["upper"]]) / length(sim_vec)
cat("simulated proportion outside range:", chk, "\n")

Related

Unable to find outside of range value using R tool

Generate 500 random numbers between 0 to 100.
Find the sum of these 500 random numbers.
Repeat steps 1) and 2) above 1000 times by generating new set of random numbers.
Assuming Y denote the sum of 500 numbers, obtain Box-Whisker plot of random variable Y.
Display values of Y which are outside mean +/- 2 *SD where SD is standard deviation.
Which statistical distribution is justified for random variable Y.
For
y <- runif(500, min = 1, max = 100) # 1
sum(y) # 2
c <- runif(1000, min = 1, max = 100) # 3
sum(c) # 4
Above mention i manage to figure out answer, but not sure whether it is correct or not.
Please help me out.
This seems to be a homework task, but let me try to point you to the right direction.
Step 1. - 3. is creating the sum of random variables. Since there is no distribution given, we assume uniform distribution.
Y <- numeric(0) # sums are stored here
for (i in 1:1000) {
Y[i] <- sum(runif(500, min=0, max=100))
}
So Y contains 1000 sums of 500 uniform distrubuted random variables.
There is another way to create this Y:
Y <- sapply(1:1000, function(x) sum(runif(500, min=0, max=100)))
For steps 4 to 6 I assume you take a look at the R help for box plots (step 4/5) and histogramms (step 6). Try ?boxplot and ?hist.
Y <- replicate(1000, sum(runif(500, min=0, max=100)))
min_val = mean(Y) - 2*sd(Y)
max_val = mean(Y) + 2*sd(Y)
Y_min <- Y[Y < min_val]
Y_max <- Y[Y > max_val]
boxplot(Y, range=1)
points(rep(1,length(Y_min)), Y_min, pch=23, col="red")
points(rep(1,length(Y_max)), Y_max, pch=23, col="blue")
You get an answer for step 6 if you understand the mathmatics. Perhaps a central limit theorem gives you some insight.

How to Standardize a Column of Data in R and Get Bell Curve Histogram to fins a percentage that falls within a ranges?

I have a data set and one of columns contains random numbers raging form 300 to 400. I'm trying to find what proportion of this column in between 320 and 350 using R. To my understanding, I need to standardize this data and creates a bell curve first. I have the mean and standard deviation but when I do (X - mean)/SD and get histogram from this column it's still not a bell curve.
This the code I tried.
myData$C1 <- (myData$C1 - C1_mean) / C1_SD
If you are simply counting the number of observations in that range, there's no need to do any standardization and you may directly use
mean(myData$C1 >= 320 & myData$C1 <= 350)
As for the standardization, it definitely doesn't create any "bell curves": it only shifts the distribution (centering) and rescales the data (dividing by the standard deviation). Other than that, the shape itself of the density function remains the same.
For instance,
x <- c(rnorm(100, mean = 300, sd = 20), rnorm(100, mean = 400, sd = 20))
mean(x >= 320 & x <= 350)
# [1] 0.065
hist(x)
hist((x - mean(x)) / sd(x))
I suspect that what you are looking for is an estimate of the true, unobserved proportion. The standardization procedure then would be applicable if you had to use tabulated values of the standard normal distribution function. However, in R we may do that without anything like that. In particular,
pnorm(350, mean = mean(x), sd = sd(x)) - pnorm(320, mean = mean(x), sd = sd(x))
# [1] 0.2091931
That's the probability P(320 <= X <= 350), where X is normally distributed with mean mean(x) and standard deviation sd(x). The figure is quite different from that above since we misspecified the underlying distribution by assuming it to be normal; it actually is a mixture of two normal distributions.

Generate uniform random variable when lower boundary is close to zero

When I run in R runif(100,max=0.1, min=1e-10)
I get 100 uniformly distributed random variables between 0.1 and 0.0001. So, there is no random value between 0.0001 and the min value (min=1e-10). How to generate uniform random variables on the whole interval (between min and max values)?
Maybe you aren't generating enough to make it likely enough that you've seen one:
> range(runif(100,max=0.1,min=exp(-10)))
[1] 0.00199544 0.09938462
> range(runif(1000,max=0.1,min=exp(-10)))
[1] 0.0002407759 0.0999674631
> range(runif(10000,max=0.1,min=exp(-10)))
[1] 5.428209e-05 9.998912e-02
How often do they occur?
> sum(runif(10000,max=0.1,min=exp(-10)) < .0001)
[1] 5
5 in that sample of 10000. So the chances of getting one in a sample of 100 is... (Actually you can work this out exactly from the number and the properties of a Uniform distribution).
(Edited to replace exp(-10) with 1e-10)
Given your max of 0.1 and min of 1e-10, the probability that any given value is less than 1e-4 is given by
(1e-4 - 1e-10) / (0.1 - 1e-10) = 9.99999e-04
The probability that 100 random values from this distribution are all greater than 1e-4 is
(1 - 9.99999e-04) ^ 100 = 0.90479
About 90.5%. So you shouldn't be at all surprised that in a draw of 100 numbers from this distribution, you didn't see any less than 1e-4. This is expected more than 90.5% of the time theoretically. We can even verify this in simulation:
set.seed(47) # for replicability
# 100,000 times, draw 100 numbers from your uniform distribution
d = replicate(n = 1e5, runif(100, max = 0.1, min = 1e-10))
# what proportion of the 100k draws have no values less than 1e-4?
mean(colSums(d < 1e-4) == 0)
# [1] 0.90557
# 90.56% - very close to our calculated 90.48%
For more precision, we can repeat with even more replications
# same thing, 1 million replications
d2 = replicate(n = 1e6, runif(100, max = 0.1, min = 1e-10))
mean(colSums(d2 < 1e-4) == 0)
# [1] 0.90481
So, with 1MM replications, runif() is almost exactly meeting expectations. It is off from the expectation by 0.90481 - 0.90479 = 0.00002. I would say there is absolutely no evidence that runif is broken.
We can even plot the histograms for some of the replications. Here are the first 20:
par(mfrow = c(4, 5), mar = rep(0.4, 4))
for (i in 1:20) {
hist(d[, i], main = "", xlab = "", axes = F,
col = "gray70", border = "gray40")
}
The histograms are showing 10 bars each, so each bar is about .01 wide (since the total range is about 0.1). The range you are interested in is about 0.0001 wide. To see that in a histogram, we would need to plot 1,000 bars per plot, 100 times as many bars. Using 1,000 bins doesn't make a lot of sense when there are only 100 values. Of course almost all the bins will be empty, and the lowest one, in particular, will be empty about 90% of the time as we calculated above.
To get more very low random values, your two choices are (a) draw more numbers from the uniform or (b) change distributions to one that has more weight closer to 0. You could try an exponential distribution? Or maybe, if you want a hard upper bound as well you could scale a beta distribution? Your other choice is to not use random values at all, maybe you want evenly spaced values and seq is what you're looking for?

Generating normal distribution data within range 0 and 1

I am working on my project about the income distribution... I would like to generate random data for testing the theory. Let say I have N=5 countries and each country has n=1000 population and i want to generate random income (NORMAL DISTRIBUTION) for each person in each population with the constraint of income is between 0 and 1 and at same mean and DIFFERENT standard deviation for all countries. I used the function rnorm(n, meanx, sd) to do it. I know that UNIFORM DISTRIBUTION (runif(n,min, max) has some arguments for setting min, max, but no rnorm. Since rnorm doesn't provide the argument for setting min and max value. I have to write a piece of code to check the set of random data to see whether they satisfy my constraints of [0,1] or not.
I successfully generated income data for n=100. However, if i increase n = k times of 100, for eg. n=200, 300 ......1000. My programme is hanging. I can see why the programs is hanging, since it just generate data randomly without constraints of min, max. Therefore, when I do with larger n, the probabilities that i will generate successfully is less than with n=100. And the loop just running again : generate data, failed check.
Technically speaking, to fix this problem, I think of breaking n=1000 into small batches, let say b=100. Since rnorm successfully generate with 100 samples in range [0,1] and it is NORMAL DISTRIBUTION, it will work well if i run the loop of 10 times of 100samples separately for each batch of 100 samples. And then, I will collect all data of 10 * 100 samples into one data of 1000 for my later analysis.
However, mathematically speakign, I am NOT SURE whether the constrain of NORMAL DISTRIBUTION for n=1000 is still satisfied or not by doing this way. I attached here my code. Hopefully my explanation is clear to you. All of your opinions will be very useful to my work. Thanks a lot.
# Update:
# plot histogram
# create the random data with same mean, different standard deviation and x in range [0,1]
# Generate the output file
# Generate data for K countries
#---------------------------------------------
# Configurable variables
number_of_populations = 5
n=100 #number of residents (*** input the number whish is k times of 100)
meanx = 0.7
sd_constant = 0.1 # sd = sd_constant + j/50
min=0 #min income
max=1 #max income
#---------------------------------------------
batch =100 # divide the large number of residents into small batch of 100
x= matrix(
0, # the data elements
nrow=n, # number of rows
ncol=number_of_populations, # number of columns
byrow = TRUE) # fill matrix by rows
x_temp = rep(0,n)
# generate income data randomly for each country
for (j in 1:number_of_populations){
# 1. Generate uniform distribution
#x[,j] <- runif(n,min, max)
# 2. Generate Normal distribution
sd = sd_constant+j/50
repeat
{
{
x_temp <- rnorm(n, meanx, sd)
is_inside = TRUE
for (i in 1:n){
if (x_temp[i]<min || x_temp[i] >max) {
is_inside = FALSE
break
}
}
}
if(is_inside==TRUE) {break}
} #end repeat
x[,j] <- x_temp
}
# write in csv
# each column stores different income of its residents
working_dir= "D:\\dataset\\"
setwd(working_dir)
file_output = "random_income.csv"
sink(file_output)
write.table(x,file=file_output,sep=",", col.names = F, row.names = F)
sink()
file.show(file_output) #show the file in directory
#plot histogram of x for each population
#par(mfrow=c(3,3), oma=c(0,0,0,0,0))
attach(mtcars)
par(mfrow=c(1,5))
for (j in 1:number_of_populations)
{
#plot(X[,i],y,'xlab'=i)
hist(x[,j],main="Normal",'xlab'=j)
}
Here's a sensible simple way...
sampnorm01 <- function(n) qnorm(runif(n,min=pnorm(0),max=pnorm(1)))
Test it out:
mysamp <- sampnorm01(1e5)
hist(mysamp)
Thanks to #PatrickPerry, here is a generalized truncated normal, again using the inverse CDF method. It allows for different parameters on the normal and different truncation bounds.
rtnorm <- function(n, mean = 0, sd = 1, min = 0, max = 1) {
bounds <- pnorm(c(min, max), mean, sd)
u <- runif(n, bounds[1], bounds[2])
qnorm(u, mean, sd)
}
Test it out:
mysamp <- rtnorm(1e5, .7, .2)
hist(mysamp)
You can normalize the data:
x = rnorm(100)
# normalize
min.x = min(x)
max.x = max(x)
x.norm = (x - min.x)/(max.x - min.x)
print(x.norm)
Here is my take on it.
The data is first normalized (at which stage the standard deviation is lost). After that, it is fitted to the range specified by the lower and upper parameters.
#' Creates a random normal distribution within the specified bounds
#'
#' WARNING: This function does not preserve the standard deviation
#' #param n The number of values to be generated
#' #param mean The mean of the distribution
#' #param sd The standard deviation of the distribution
#' #param lower The lower limit of the distribution
#' #param upper The upper limit of the distribution
rtnorm <- function(n, mean = 0, sd = 1, lower = -1, upper = 1){
mean = ifelse(test = (is.na(mean)|| (mean < lower) || (mean > upper)),
yes = mean(c(lower, upper)),
no = mean)
data <- rnorm(n, mean = mean, sd = sd) # data
if (!is.na(lower) && !is.na(upper)){ # adjust data to specified range
drange <- range(data) # data range
irange <- range(lower, upper) # input range
data <- (data - drange[1]) / (drange[2] - drange[1]) # normalize data (make it 0 to 1)
data <- (data * (irange[2] - irange[1])) + irange[1] # adjust to specified range
}
return(data)
}
Example:
a <- rtnorm(n = 1000, lower = 10, upper = 90)
range(a)
plot(hist(a, 50))

Efficiently generating discrete random numbers

I want to quickly generate discrete random numbers where I have a known CDF. Essentially, the algorithm is:
Construct the CDF vector (an increasing vector starting at 0 and end at 1) cdf
Generate a uniform(0, 1) random number u
If u < cdf[1] choose 1
else if u < cdf[2] choose 2
else if u < cdf[3] choose 3
*...
Example
First generate an cdf:
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
Next generate N uniform random numbers:
N = 1000
u = runif(N)
Now sample the value:
##With some experimenting this seemed to be very quick
##However, with N = 100000 we run out of memory
##N = 10^6 would be a reasonable maximum to cope with
colSums(sapply(u, ">", cdf))
If you know the probability mass function (which you do, if you know the cumulative distribution function), you can use R's built-in sample function, where you can define the probabilities of discrete events with argument prob.
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
system.time(sample(size=1e6,x=1:10000,prob=c(cdf[1],diff(cdf)),replace=TRUE))
user system elapsed
0.01 0.00 0.02
How about using cut:
N <- 1e6
u <- runif(N)
system.time(as.numeric(cut(u,cdf)))
user system elapsed
1.03 0.03 1.07
head(table(as.numeric(cut(u,cdf))))
1 2 3 4 5 6
51 95 165 172 148 75
If you have a finite number of possible values then you can use findInterval or cut or better sample as mentioned by #Hemmo.
However, if you want to generate data from a distribution that that theoretically goes to infinity (like the geometric, negative binomial, Poisson, etc.) then here is an algorithm that will work (this will also work with a finite number of values if wanted):
Start with your vector of uniform values and loop through the distribution values subtracting them from the vector of uniforms, the random value is the iteration where the value goes negative. This is a easier to see whith an example. This generates values from a Poisson with mean 5 (replace the dpois call with your calculated values) and compares it to using the inverse CDF (which is more efficient in this case where it exists).
i <- 0
tmp <- tmp2 <- runif(10000)
randvals <- rep(0, length(tmp) )
while( any(tmp > 0) ) {
tmp <- tmp - dpois(i, 5)
randvals <- randvals + (tmp > 0)
i <- i + 1
}
randvals2 <- qpois( tmp2, 5 )
all.equal(randvals, randvals2)

Resources