Generating normal distribution data within range 0 and 1 - r

I am working on my project about the income distribution... I would like to generate random data for testing the theory. Let say I have N=5 countries and each country has n=1000 population and i want to generate random income (NORMAL DISTRIBUTION) for each person in each population with the constraint of income is between 0 and 1 and at same mean and DIFFERENT standard deviation for all countries. I used the function rnorm(n, meanx, sd) to do it. I know that UNIFORM DISTRIBUTION (runif(n,min, max) has some arguments for setting min, max, but no rnorm. Since rnorm doesn't provide the argument for setting min and max value. I have to write a piece of code to check the set of random data to see whether they satisfy my constraints of [0,1] or not.
I successfully generated income data for n=100. However, if i increase n = k times of 100, for eg. n=200, 300 ......1000. My programme is hanging. I can see why the programs is hanging, since it just generate data randomly without constraints of min, max. Therefore, when I do with larger n, the probabilities that i will generate successfully is less than with n=100. And the loop just running again : generate data, failed check.
Technically speaking, to fix this problem, I think of breaking n=1000 into small batches, let say b=100. Since rnorm successfully generate with 100 samples in range [0,1] and it is NORMAL DISTRIBUTION, it will work well if i run the loop of 10 times of 100samples separately for each batch of 100 samples. And then, I will collect all data of 10 * 100 samples into one data of 1000 for my later analysis.
However, mathematically speakign, I am NOT SURE whether the constrain of NORMAL DISTRIBUTION for n=1000 is still satisfied or not by doing this way. I attached here my code. Hopefully my explanation is clear to you. All of your opinions will be very useful to my work. Thanks a lot.
# Update:
# plot histogram
# create the random data with same mean, different standard deviation and x in range [0,1]
# Generate the output file
# Generate data for K countries
#---------------------------------------------
# Configurable variables
number_of_populations = 5
n=100 #number of residents (*** input the number whish is k times of 100)
meanx = 0.7
sd_constant = 0.1 # sd = sd_constant + j/50
min=0 #min income
max=1 #max income
#---------------------------------------------
batch =100 # divide the large number of residents into small batch of 100
x= matrix(
0, # the data elements
nrow=n, # number of rows
ncol=number_of_populations, # number of columns
byrow = TRUE) # fill matrix by rows
x_temp = rep(0,n)
# generate income data randomly for each country
for (j in 1:number_of_populations){
# 1. Generate uniform distribution
#x[,j] <- runif(n,min, max)
# 2. Generate Normal distribution
sd = sd_constant+j/50
repeat
{
{
x_temp <- rnorm(n, meanx, sd)
is_inside = TRUE
for (i in 1:n){
if (x_temp[i]<min || x_temp[i] >max) {
is_inside = FALSE
break
}
}
}
if(is_inside==TRUE) {break}
} #end repeat
x[,j] <- x_temp
}
# write in csv
# each column stores different income of its residents
working_dir= "D:\\dataset\\"
setwd(working_dir)
file_output = "random_income.csv"
sink(file_output)
write.table(x,file=file_output,sep=",", col.names = F, row.names = F)
sink()
file.show(file_output) #show the file in directory
#plot histogram of x for each population
#par(mfrow=c(3,3), oma=c(0,0,0,0,0))
attach(mtcars)
par(mfrow=c(1,5))
for (j in 1:number_of_populations)
{
#plot(X[,i],y,'xlab'=i)
hist(x[,j],main="Normal",'xlab'=j)
}

Here's a sensible simple way...
sampnorm01 <- function(n) qnorm(runif(n,min=pnorm(0),max=pnorm(1)))
Test it out:
mysamp <- sampnorm01(1e5)
hist(mysamp)
Thanks to #PatrickPerry, here is a generalized truncated normal, again using the inverse CDF method. It allows for different parameters on the normal and different truncation bounds.
rtnorm <- function(n, mean = 0, sd = 1, min = 0, max = 1) {
bounds <- pnorm(c(min, max), mean, sd)
u <- runif(n, bounds[1], bounds[2])
qnorm(u, mean, sd)
}
Test it out:
mysamp <- rtnorm(1e5, .7, .2)
hist(mysamp)

You can normalize the data:
x = rnorm(100)
# normalize
min.x = min(x)
max.x = max(x)
x.norm = (x - min.x)/(max.x - min.x)
print(x.norm)

Here is my take on it.
The data is first normalized (at which stage the standard deviation is lost). After that, it is fitted to the range specified by the lower and upper parameters.
#' Creates a random normal distribution within the specified bounds
#'
#' WARNING: This function does not preserve the standard deviation
#' #param n The number of values to be generated
#' #param mean The mean of the distribution
#' #param sd The standard deviation of the distribution
#' #param lower The lower limit of the distribution
#' #param upper The upper limit of the distribution
rtnorm <- function(n, mean = 0, sd = 1, lower = -1, upper = 1){
mean = ifelse(test = (is.na(mean)|| (mean < lower) || (mean > upper)),
yes = mean(c(lower, upper)),
no = mean)
data <- rnorm(n, mean = mean, sd = sd) # data
if (!is.na(lower) && !is.na(upper)){ # adjust data to specified range
drange <- range(data) # data range
irange <- range(lower, upper) # input range
data <- (data - drange[1]) / (drange[2] - drange[1]) # normalize data (make it 0 to 1)
data <- (data * (irange[2] - irange[1])) + irange[1] # adjust to specified range
}
return(data)
}
Example:
a <- rtnorm(n = 1000, lower = 10, upper = 90)
range(a)
plot(hist(a, 50))

Related

How to generate spatially correlated random fields of very high dimension with R

This is an extended question I found from here (Method #1: http://santiago.begueria.es/2010/10/generating-spatially-correlated-random-fields-with-r/) and here (Method #2: https://gist.github.com/brentp/1306786). I know these two sites covered very well (Thanks!) with relatively small size of dimension (e.g., 1000x1). I am trying to generate spatially clustered binary data with large size of dimension like >=100000x1 dimension, for example, c(1,1,1,1,0,1,0,0,0,0, …, 0,0,0,0,0,0,0,0,0,0,0,0) with 1000 times / case study. Here are slightly modified codes from the sites.
# Method #1
dim1 <- 1000
dim2 <- 1
xy <- expand.grid(seq_len(dim1), seq_len(dim2))
colnames(xy) <- c("x", "y")
geo.model <- gstat(formula = z~x+y, locations = ~x+y, dummy = TRUE, beta = 0,
model = vgm(psill = 1,"Exp",
range = dim1), # Range parameter!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
nmax = 30) # Spatial correlation model
sim.mat <- predict(geo.model, newdata = xy, nsim = 1)
sim.mat[,3] <- ifelse(sim.mat[,3] > quantile(sim.mat[,3], .1), 0, 1)
plot(sim.mat[, 3])
# Method #2
# generate autocorrelated data.
nLags = 1000 # number of lags (size of region)
# fake, uncorrelated observations
X = rnorm(nLags)
# fake sigma... correlated decreases distance.
sigma = diag(nLags)
corr = .999
sigma <- corr ^ abs(row(sigma)-col(sigma))
#sigma
# Y is autocorrelated...
Y <- t(X %*% chol(sigma))
y <- ifelse(Y >= quantile(Y, probs=.9), 1, 0)[, 1]
plot(y)
Both methods work very well to generate binary data when dim1 is less than 10000. However, when I tried several hundred thousand (e.g., >= 100,000), it seems to take a long time or memory issue.
For example, when I used “nLags = 50000” in Method #2, I got an error message (“Error: cannot allocate vector of size 9.3 Gb”) after the code “sigma <- corr ^ abs(row(sigma)-col(sigma))”.
I would like to find an efficient (time- and memory-saving) way to generate such a spatially clustered binary data 1000 times (especially, with dim1 >= 100000) per each case study (about 200 cases).
I have thought about applying multiple probabilities in "sample" function or probability distribution. I am not sure how to and beyond my scope.

How to generate normally distributed random numbers in specific interval?

I want to generate 100 normally distributed random number in interval [-50,50]. However in the below code the range of random number generated is [-50,50].
n <- rnorm(100, -50,50)
plot(n)
Your question is atrangely asked, because it seems you don't fully understand the rnorm function.
rnorm(100, -50,50)
generates a sample of 100 points given by a normal distribution centered on -50, with a standard deviation of 50. So you need to specifiy what you want by :
100 normally distributed random number in interval [-50,50]. In a normal distribution you don't give an upper and lower limit : the probability of drawing is never 0, but is just very low when being several standard deviation away from the mean. So:
Or you want a normal distribution centered on 0 with 50 standard deviation, and the answer is rnorm(100, 0,50), but you will have values above 50 and below -50.
Or you actually want a normal distribution with no value outside the [-50,50] range, and in this case you still need to give a standard deviation, and you will need to cut the values draw outside the range. You could do something like:
sd <- 50
n <- data.frame(draw = rnorm(1000, 0,sd))
final <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
Here is an example of what it does for 2 different sd:
sd <- 10
n1 <- data.frame(draw = rnorm(1000, 0,sd))
final1 <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
sd <- 50
n2 <- data.frame(draw = rnorm(1000, 0,sd))
final2 <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
par(mfrow = c(1,2))
hist(final1,main = "sd = 10")
hist(final2,main = "sd = 50")
or you just want to sample values in this range with a flat distribution. In this case, just sample(-50:50,100,replace = T)
You have to make a sacrifice. Either your random variable is not normally distributed because the tails are cut off, or you compromise on the boundaries. You can define your random variable to "practically" lie in a range, this is you accept that a very small percentage lies outside. Maybe 1 % would be an acceptable choice for your purpose.
my_range <- setNames(c(-50, 50), c("lower", "upper"))
prob <- 0.01 # probability to lie outside of my_range
# you have to define this, 1 % in this case
my <- mean(my_range)
z_value <- qnorm(prob/2)
sigma <- (my - my_range["lower"]) / (-1 * z_value)
# proof
N <- 100000 # large number
sim_vec <- rnorm(N, my, sigma)
chk <- 1 - length(sim_vec[sim_vec >= my_range["lower"] &
sim_vec <= my_range["upper"]]) / length(sim_vec)
cat("simulated proportion outside range:", chk, "\n")

Change certain values of a vector based on mean and standard deviation of its subsets

I am trying to inject anomalies into a dataset, essentially changing certain values, based on a condition. I have a dataset, there are 10 subsets. The condition is that anomalies would be 2.8-3 times the standard deviation of each segment away from the mean of that subset. For that, I am dividing the dataset into 10 equal parts, then calculating the mean and standard deviation of each subset, and changing certain values by putting them 3 standard deviations of that subset away from the mean of that subset. The code looks like the following:
set.seed(1)
x <- rnorm(sample(1:35000, 32000, replace=F),0,1) #create dataset
y <- cumsum(x) #cumulative sum of dataset
j=1
for(i in c(1:10)){
seg = y[j:j+3000] #name each subset seg
m = mean(seg) #mean of subset
print(m)
s = sd(seg) # standard deviation of subset
print(s)
o_data = sample(j:j+3000,10) #draw random numbers from j to j + 3000
print(o_data)
y[o_data] = m + runif(10, min=2.8, max=3) * s #values = mean + 2.8-3 * sd
print(y[o_data])
j = j + 3000 # increment j
print(j)
}
The error I get is that standard deviation is NA, so I am not able to set the values.
What other approach is there by which I can accomplish the task? I have the inject anomalies which are 2.8-3 standard deviations away from the rolling mean essentially.
You have a simple error in your code. when you wrote
seg = y[j:j+3000] I believe that you meant seg = y[j:(j+3000)]
Similarly o_data = sample(j:j+3000,10) should be o_data = sample(j:(j+3000),10)

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

Generating samples from a two-Gaussian mixture in r (code given in MATLAB)

I'm trying to create (in r) the equivalent to the following MATLAB function that will generate n samples from a mixture of N(m1,(s1)^2) and N(m2, (s2)^2) with a fraction, alpha, from the first Gaussian.
I have a start, but the results are notably different between MATLAB and R (i.e., the MATLAB results give occasional values of +-8 but the R version never even gives a value of +-5). Please help me sort out what is wrong here. Thanks :-)
For Example:
Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one.
MATLAB
function
function y = gaussmix(n,m1,m2,s1,s2,alpha)
y = zeros(n,1);
U = rand(n,1);
I = (U < alpha)
y = I.*(randn(n,1)*s1+m1) + (1-I).*(randn(n,1)*s2 + m2);
implementation
P = gaussmix(1000,0,0,1,6,.95)
P = (P-mean(P))/std(P)
plot(P)
axis([0 1000 -15 15])
hist(P)
axis([-15 15 0 1000])
resulting plot
resulting hist
R
yn <- rbinom(1000, 1, .95)
s <- rnorm(1000, 0 + 0*yn, 1 + 36*yn)
sn <- (s-mean(s))/sd(s)
plot(sn, xlim=range(0,1000), ylim=range(-15,15))
hist(sn, xlim=range(-15,15), ylim=range(0,1000))
resulting plot
resulting hist
As always, THANK YOU!
SOLUTION
gaussmix <- function(nsim,mean_1,mean_2,std_1,std_2,alpha){
U <- runif(nsim)
I <- as.numeric(U<alpha)
y <- I*rnorm(nsim,mean=mean_1,sd=std_1)+
(1-I)*rnorm(nsim,mean=mean_2,sd=std_2)
return(y)
}
z1 <- gaussmix(1000,0,0,1,6,0.95)
z1_standardized <- (z1-mean(z1))/sqrt(var(z1))
z2 <- gaussmix(1000,0,3,1,1,0.80)
z2_standardized <- (z2-mean(z2))/sqrt(var(z2))
z3 <- rlnorm(1000)
z3_standardized <- (z3-mean(z3))/sqrt(var(z3))
par(mfrow=c(2,3))
hist(z1_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 95% of N(0,1) and 5% of N(0,36)",
col="blue",xlab=" ")
hist(z2_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 80% of N(0,1) and 10% of N(3,1)",
col="blue",xlab=" ")
hist(z3_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of samples of LN(0,1)",col="blue",xlab=" ")
##
plot(z1_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(0,36)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z2_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(3,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z3_standardized,type='l',
main="1000 samples from LN(0,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
There are two problems, I think ... (1) your R code is creating a mixture of normal distributions with standard deviations of 1 and 37. (2) By setting prob equal to alpha in your rbinom() call, you're getting a fraction alpha in the second mode rather than the first. So what you are getting is a distribution that is mostly a Gaussian with sd 37, contaminated by a 5% mixture of Gaussian with sd 1, rather than a Gaussian with sd 1 that is contaminated by a 5% mixture of a Gaussian with sd 6. Scaling by the standard deviation of the mixture (which is about 36.6) basically reduces it to a standard Gaussian with a slight bump near the origin ...
(The other answers posted here do solve your problem perfectly well, but I thought you might be interested in a diagnosis ...)
A more compact (and perhaps more idiomatic) version of your Matlab gaussmix function (I think runif(n)<alpha is slightly more efficient than rbinom(n,size=1,prob=alpha) )
gaussmix <- function(n,m1,m2,s1,s2,alpha) {
I <- runif(n)<alpha
rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}
set.seed(1001)
s <- gaussmix(1000,0,0,1,6,0.95)
Not that you asked for it, but the mclust package offers a way to generalize your problem to more dimensions and diverse covariance structures. See ?mclust::sim. The example task would be done this way:
require(mclust)
simdata = sim(modelName = "V",
parameters = list(pro = c(0.95, 0.05),
mean = c(0, 0),
variance = list(modelName = "V",
d = 1,
G = 2,
sigmasq = c(0, 36))),
n = 1000)
plot(scale(simdata[,2]), type = "h")
I recently wrote the density and sampling function of a multinomial mixture of normal distributions:
dmultiNorm <- function(x,means,sds,weights)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
N <- length(x)
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
weights <- weights/sum(weights)
dens <- numeric(N)
for (i in 1:n)
{
dens <- dens + weights[i] * dnorm(x,means[i],sds[i])
}
return(dens)
}
rmultiNorm <- function(N,means,sds,weights,scale=TRUE)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
Res <- numeric(N)
for (i in 1:N)
{
s <- sample(1:n,1,prob=weights)
Res[i] <- rnorm(1,means[s],sds[s])
}
return(Res)
}
With means being a vector of means, sds being a vector of standard deviatians and weights being a vector with proportional probabilities to sample from each of the distributions. Is this useful to you?
Here is code to do this task:
"For Example: Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one."
plot(multG <- c( rnorm(950), rnorm(50, 0, 36))[sample(1000)] , type="h")
scmulG <- scale(multG)
summary(scmulG)
#-----------
V1
Min. :-9.01845
1st Qu.:-0.06544
Median : 0.03841
Mean : 0.00000
3rd Qu.: 0.13940
Max. :12.33107

Resources