Unable to find outside of range value using R tool - r

Generate 500 random numbers between 0 to 100.
Find the sum of these 500 random numbers.
Repeat steps 1) and 2) above 1000 times by generating new set of random numbers.
Assuming Y denote the sum of 500 numbers, obtain Box-Whisker plot of random variable Y.
Display values of Y which are outside mean +/- 2 *SD where SD is standard deviation.
Which statistical distribution is justified for random variable Y.
For
y <- runif(500, min = 1, max = 100) # 1
sum(y) # 2
c <- runif(1000, min = 1, max = 100) # 3
sum(c) # 4
Above mention i manage to figure out answer, but not sure whether it is correct or not.
Please help me out.

This seems to be a homework task, but let me try to point you to the right direction.
Step 1. - 3. is creating the sum of random variables. Since there is no distribution given, we assume uniform distribution.
Y <- numeric(0) # sums are stored here
for (i in 1:1000) {
Y[i] <- sum(runif(500, min=0, max=100))
}
So Y contains 1000 sums of 500 uniform distrubuted random variables.
There is another way to create this Y:
Y <- sapply(1:1000, function(x) sum(runif(500, min=0, max=100)))
For steps 4 to 6 I assume you take a look at the R help for box plots (step 4/5) and histogramms (step 6). Try ?boxplot and ?hist.
Y <- replicate(1000, sum(runif(500, min=0, max=100)))
min_val = mean(Y) - 2*sd(Y)
max_val = mean(Y) + 2*sd(Y)
Y_min <- Y[Y < min_val]
Y_max <- Y[Y > max_val]
boxplot(Y, range=1)
points(rep(1,length(Y_min)), Y_min, pch=23, col="red")
points(rep(1,length(Y_max)), Y_max, pch=23, col="blue")
You get an answer for step 6 if you understand the mathmatics. Perhaps a central limit theorem gives you some insight.

Related

Generating bivariate data where x variable is uniformly distributed between 0 and 1 and Y is normally distributed with mean 1/x with some noise

I used x <- c(runif(100, 0, 1)) to generate 100 x's between 0 and 1.
Now for each of the x's I am trying to generate 10 y's with mean 1/x and variance of 1.
Preferably stored in a matrix and so if I was to plot the 1000 points on y and x, it would look like the graph y = 1/x + some error.
Any help would be greatly appreciated.
If you want the data in a matrix, then you can do
x <- runif(100, 0, 1)
y <- sapply(x, function(m) rnorm(10, 1/m, 1))
This uses sapply to generate 10 normal values for each x value.
If you wanted one, two-column, matrix, then maybe
points <- do.call("rbind", lapply(x, function(m) cbind(x=m, y=rnorm(10, 1/m, 1))))
is what you want. You can plot that with
plot(y~x, points)

How to generate normally distributed random numbers in specific interval?

I want to generate 100 normally distributed random number in interval [-50,50]. However in the below code the range of random number generated is [-50,50].
n <- rnorm(100, -50,50)
plot(n)
Your question is atrangely asked, because it seems you don't fully understand the rnorm function.
rnorm(100, -50,50)
generates a sample of 100 points given by a normal distribution centered on -50, with a standard deviation of 50. So you need to specifiy what you want by :
100 normally distributed random number in interval [-50,50]. In a normal distribution you don't give an upper and lower limit : the probability of drawing is never 0, but is just very low when being several standard deviation away from the mean. So:
Or you want a normal distribution centered on 0 with 50 standard deviation, and the answer is rnorm(100, 0,50), but you will have values above 50 and below -50.
Or you actually want a normal distribution with no value outside the [-50,50] range, and in this case you still need to give a standard deviation, and you will need to cut the values draw outside the range. You could do something like:
sd <- 50
n <- data.frame(draw = rnorm(1000, 0,sd))
final <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
Here is an example of what it does for 2 different sd:
sd <- 10
n1 <- data.frame(draw = rnorm(1000, 0,sd))
final1 <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
sd <- 50
n2 <- data.frame(draw = rnorm(1000, 0,sd))
final2 <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
par(mfrow = c(1,2))
hist(final1,main = "sd = 10")
hist(final2,main = "sd = 50")
or you just want to sample values in this range with a flat distribution. In this case, just sample(-50:50,100,replace = T)
You have to make a sacrifice. Either your random variable is not normally distributed because the tails are cut off, or you compromise on the boundaries. You can define your random variable to "practically" lie in a range, this is you accept that a very small percentage lies outside. Maybe 1 % would be an acceptable choice for your purpose.
my_range <- setNames(c(-50, 50), c("lower", "upper"))
prob <- 0.01 # probability to lie outside of my_range
# you have to define this, 1 % in this case
my <- mean(my_range)
z_value <- qnorm(prob/2)
sigma <- (my - my_range["lower"]) / (-1 * z_value)
# proof
N <- 100000 # large number
sim_vec <- rnorm(N, my, sigma)
chk <- 1 - length(sim_vec[sim_vec >= my_range["lower"] &
sim_vec <= my_range["upper"]]) / length(sim_vec)
cat("simulated proportion outside range:", chk, "\n")

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

Generating normal distribution data within range 0 and 1

I am working on my project about the income distribution... I would like to generate random data for testing the theory. Let say I have N=5 countries and each country has n=1000 population and i want to generate random income (NORMAL DISTRIBUTION) for each person in each population with the constraint of income is between 0 and 1 and at same mean and DIFFERENT standard deviation for all countries. I used the function rnorm(n, meanx, sd) to do it. I know that UNIFORM DISTRIBUTION (runif(n,min, max) has some arguments for setting min, max, but no rnorm. Since rnorm doesn't provide the argument for setting min and max value. I have to write a piece of code to check the set of random data to see whether they satisfy my constraints of [0,1] or not.
I successfully generated income data for n=100. However, if i increase n = k times of 100, for eg. n=200, 300 ......1000. My programme is hanging. I can see why the programs is hanging, since it just generate data randomly without constraints of min, max. Therefore, when I do with larger n, the probabilities that i will generate successfully is less than with n=100. And the loop just running again : generate data, failed check.
Technically speaking, to fix this problem, I think of breaking n=1000 into small batches, let say b=100. Since rnorm successfully generate with 100 samples in range [0,1] and it is NORMAL DISTRIBUTION, it will work well if i run the loop of 10 times of 100samples separately for each batch of 100 samples. And then, I will collect all data of 10 * 100 samples into one data of 1000 for my later analysis.
However, mathematically speakign, I am NOT SURE whether the constrain of NORMAL DISTRIBUTION for n=1000 is still satisfied or not by doing this way. I attached here my code. Hopefully my explanation is clear to you. All of your opinions will be very useful to my work. Thanks a lot.
# Update:
# plot histogram
# create the random data with same mean, different standard deviation and x in range [0,1]
# Generate the output file
# Generate data for K countries
#---------------------------------------------
# Configurable variables
number_of_populations = 5
n=100 #number of residents (*** input the number whish is k times of 100)
meanx = 0.7
sd_constant = 0.1 # sd = sd_constant + j/50
min=0 #min income
max=1 #max income
#---------------------------------------------
batch =100 # divide the large number of residents into small batch of 100
x= matrix(
0, # the data elements
nrow=n, # number of rows
ncol=number_of_populations, # number of columns
byrow = TRUE) # fill matrix by rows
x_temp = rep(0,n)
# generate income data randomly for each country
for (j in 1:number_of_populations){
# 1. Generate uniform distribution
#x[,j] <- runif(n,min, max)
# 2. Generate Normal distribution
sd = sd_constant+j/50
repeat
{
{
x_temp <- rnorm(n, meanx, sd)
is_inside = TRUE
for (i in 1:n){
if (x_temp[i]<min || x_temp[i] >max) {
is_inside = FALSE
break
}
}
}
if(is_inside==TRUE) {break}
} #end repeat
x[,j] <- x_temp
}
# write in csv
# each column stores different income of its residents
working_dir= "D:\\dataset\\"
setwd(working_dir)
file_output = "random_income.csv"
sink(file_output)
write.table(x,file=file_output,sep=",", col.names = F, row.names = F)
sink()
file.show(file_output) #show the file in directory
#plot histogram of x for each population
#par(mfrow=c(3,3), oma=c(0,0,0,0,0))
attach(mtcars)
par(mfrow=c(1,5))
for (j in 1:number_of_populations)
{
#plot(X[,i],y,'xlab'=i)
hist(x[,j],main="Normal",'xlab'=j)
}
Here's a sensible simple way...
sampnorm01 <- function(n) qnorm(runif(n,min=pnorm(0),max=pnorm(1)))
Test it out:
mysamp <- sampnorm01(1e5)
hist(mysamp)
Thanks to #PatrickPerry, here is a generalized truncated normal, again using the inverse CDF method. It allows for different parameters on the normal and different truncation bounds.
rtnorm <- function(n, mean = 0, sd = 1, min = 0, max = 1) {
bounds <- pnorm(c(min, max), mean, sd)
u <- runif(n, bounds[1], bounds[2])
qnorm(u, mean, sd)
}
Test it out:
mysamp <- rtnorm(1e5, .7, .2)
hist(mysamp)
You can normalize the data:
x = rnorm(100)
# normalize
min.x = min(x)
max.x = max(x)
x.norm = (x - min.x)/(max.x - min.x)
print(x.norm)
Here is my take on it.
The data is first normalized (at which stage the standard deviation is lost). After that, it is fitted to the range specified by the lower and upper parameters.
#' Creates a random normal distribution within the specified bounds
#'
#' WARNING: This function does not preserve the standard deviation
#' #param n The number of values to be generated
#' #param mean The mean of the distribution
#' #param sd The standard deviation of the distribution
#' #param lower The lower limit of the distribution
#' #param upper The upper limit of the distribution
rtnorm <- function(n, mean = 0, sd = 1, lower = -1, upper = 1){
mean = ifelse(test = (is.na(mean)|| (mean < lower) || (mean > upper)),
yes = mean(c(lower, upper)),
no = mean)
data <- rnorm(n, mean = mean, sd = sd) # data
if (!is.na(lower) && !is.na(upper)){ # adjust data to specified range
drange <- range(data) # data range
irange <- range(lower, upper) # input range
data <- (data - drange[1]) / (drange[2] - drange[1]) # normalize data (make it 0 to 1)
data <- (data * (irange[2] - irange[1])) + irange[1] # adjust to specified range
}
return(data)
}
Example:
a <- rtnorm(n = 1000, lower = 10, upper = 90)
range(a)
plot(hist(a, 50))

Generate numbers in R

In R, how can I generate N numbers that have a mean of X and a median of Y (at least close to).
Or perhaps more generally, is there an algorithm for this?
There is an infinite number of solutions.
Approximate algorithm:
Generate n/2 numbers below the median
Generate n/2 numbers above the median
Add you desired median and check
Add one number with enough weight to satisfy your mean -- which you can solve
Example assuming you want a median of zero and a mean of twenty:
R> set.seed(42)
R> lo <- rnorm(10, -10); hi <- rnorm(10, 10)
R> median(c(lo,0,hi))
[1] 0 # this meets our first criterion
R> 22*20 - sum(c(lo,0,hi)) # (n+1)*desiredMean - currentSum
[1] 436.162 # so if we insert this, we the right answer
R> mean(c(lo,0,hi,22*20 - sum(c(lo,0,hi))))
[1] 20 # so we meet criterion two
R>
because desiredMean times (n+1) has to be equal to sum(currentSet) + x so we solve for x getting the expression above.
For a set of data that looks fairly 'normal', you can use the correction factor method as outlined by #Dirk-Eddelbuettel but with your custom values used to generate a set of data around your mean:
X = 25
Y = 25.5
N = 100
set.sd = 5 # if you want to set the standard deviation of the set.
set <- rnorm(N, Y, set.sd) # generate a set around the mean
set.left <- set[set < X] # take only the left half
set <- c(set.left, X + (X - set.left)) # ... and make a copy on the right.
# redefine the set, adding in the correction number and an extra number on the opposite side to the correction:
set <- c(set,
X + ((set.sd / 2) * sign(X - Y)),
((length(set)+ 2) * Y)
- sum(set, X + ((set.sd / 2) * sign(X - Y)))
)
Take strong heed of the first answer's first sentence. Unless you know what underlying distribution you want, you can't do it. Once you know that distribution, there are R-functions for many standards such as runif, rnorm, rchisq . You can create an arb. dist with the sample function.
If you are okay with the restriction X < Y, then you can fit a lognormal distribution. The lognormal conveniently has closed forms for both mean and median.
rmm <- function(n, X, Y) rlnorm(n, log(Y), sqrt(2*log(X/Y)))
E.g.:
z <- rmm(10000, 3, 1)
mean(z)
# [1] 2.866567
median(z)
# [1] 0.9963516

Resources