Finding mean of standard normal distribution in a given interval - r

I want to find mean of standard normal distribution in a given interval.
For example, if I divide standard normal distribution into two ([-Inf:0] [0:Inf]) I want to get the mean of each half.
Following code does almost exactly what I want:
divide <- 2
boundaries <- qnorm(seq(0,1,length.out=divide+1))
t <- sort(rnorm(100000))
means.1 <- rep(NA,divide)
for (i in 1:divide) {
means.1[i] <- mean(t[(t>boundaries[i])&(t<boundaries[i+1])])
}
But I need a more precise (and elegant) method to calculate these numbers (means.1).
I tried the following code but it did not work (maybe because of the lack of my probability knowledge).
divide <- 2
boundaries <- qnorm(seq(0,1,length.out=divide+1))
means.2 <- rep(NA,divide)
f <- function(x) {x*dnorm(x)}
for (i in 1:divide) {
means.2[i] <- integrate(f,lower=boundaries[i],upper=boundaries[i+1])$value
}
Any ideas?
Thanks in advance.

The problem is that the integral of dnorm(x) in the interval (-Inf to 0) isn't 1, that's why you got the wrong answer. To correct you must divide the result you got by 0.5 (the integral result). Like:
func <- function(x, ...) x * dnorm(x, ...)
integrate(func, -Inf, 0, mean=0, sd=1)$value / (pnorm(0, mean=0, sd=1) - pnorm(-Inf, mean=0, sd=1))
Adapt it to differents intervals should be easy.

Thanks for answering my question.
I combined all answers as I understand:
divide <- 5
boundaries <- qnorm(seq(0,1,length.out=divide+1))
# My original thinking
t <- sort(rnorm(1e6))
means.1 <- rep(NA,divide)
for (i in 1:divide) {
means.1[i] <- mean(t[((t>boundaries[i])&(t<boundaries[i+1]))])
}
# Based on #DWin
t <- sort(rnorm(1e6))
means.2 <- tapply(t, findInterval(t, boundaries), mean)
# Based on #Rcoster
means.3 <- rep(NA,divide)
f <- function(x, ...) x * dnorm(x, ...)
for (i in 1:divide) {
means.3[i] <- integrate(f, boundaries[i], boundaries[i+1])$value / (pnorm(boundaries[i+1]) - pnorm(boundaries[i]))
}
# Based on #Kith
t <- sort(rnorm(1e6))
means.4 <- rep(NA,divide)
for (i in 1:divide) {
means.4[i] <- fitdistr(t[t > boundaries[i] & t < boundaries[i+1]], densfun="normal")$estimate[1]
}
Results
> means.1
[1] -1.4004895486 -0.5323784986 -0.0002590746 0.5313539906 1.3978177100
> means.2
[1] -1.3993590768 -0.5329465789 -0.0002875593 0.5321381745 1.3990997391
> means.3
[1] -1.399810e+00 -5.319031e-01 1.389222e-16 5.319031e-01 1.399810e+00
> means.4
[1] -1.399057073 -0.531946615 -0.000250952 0.531615180 1.400086731
I believe #Rcoster is the one that I wanted. Rest is innovative approaches compared to mine but still approximate.
Thanks.

You can use a combination of fitdistr and vector indexing.
Here's an example of how to get mean and std of just the positive values:
library("MASS")
x = rnorm(10000)
fitdistr(x[x > 0], densfun="normal")
or just the values in the interval (0,2):
fitdistr(x[x > 0 & x < 2], densfun="normal")

Let's say your cutpoints are -1, 0, 1, and 2 and you are interested in the mean of sections simulating a standard Normal.
samp <- rnorm(1e5)
(res <- tapply(samp, findInterval(samp, c( -1, 0, 1, 2)), mean) )
# 0 1 2 3 4
#-1.5164151 -0.4585519 0.4608587 1.3836470 2.3824633
Please do note that the labeling could be improved. One improvement could be:
names(res) <- paste("[", c(-Inf, -1, 0, 1, 2, Inf)[-6], " , ",
c(-Inf, -1, 0, 1, 2, Inf)[-1], ")", sep="")
> res
[-Inf , -1) [-1 , 0) [0 , 1) [1 , 2) [2 , Inf)
-1.5278185 -0.4623743 0.4621885 1.3834442 2.3835116

Using the distrEx and distr packages:
library(distrEx)
E(Truncate(Norm(mean=0, sd=1), lower=0, upper=Inf))
# [1] 0.797884
(See vignette(distr) in the distrDoc package for an excellent overview of the suite of distr and related packages.)
Or, using just base R, here's an alternative that constructs a discrete approximation of the expectation within the interval between lb and ub. The bases of the approximating rectangles are adjusted so that they all have equal areas (i.e. so that the probability of a point falling in each one of them is identical).
intervalMean <- function(lb, ub, n=1e5, ...) {
## Get x-values at n evenly-spaced quantiles between lower and upper bounds
xx <- qnorm(seq(pnorm(lb, ...), pnorm(ub, ...), length = n), ...)
## Calculate expectation
mean(xx[is.finite(xx)])
}
## Your example
intervalMean(lb=0, ub=1)
# [1] 0.4598626
## The mean of the complete normal distribution
intervalMean(-Inf, Inf)
## [1] -6.141351e-17
## Right half of standard normal distribution
intervalMean(lb=0, ub=Inf)
# [1] 0.7978606
## Right half of normal distribution with mean 0 and standard deviation 100
intervalMean(lb=0, ub=Inf, mean=0, sd=100)
# [1] 79.78606

Related

Very slow double integrals with built-in integration or cubature, wrong result with prac2d in R

I have a question concerning the computation of a double integral in R. Maybe it is not the best software package to try numerical integration, but we are heavily relying on its stochastic optimisation packages (the function to be optimised is very non-trivial, with lots of local minima), so we cannot switch to MATLAB or other packages.
The problem is the following: it takes a whale of a time to compute the double integral using nested integrate functions, and several times more (!) using the hcubature approach from the cubature package. I tried the first solution from this answer (using hcubature from the cubature package), but it made the timing even worse; besides that, infinite integration limits are not supported, and the integration chokes for (-100, 100) interval already. With the second solution (quad2d from pracma package), the timing is great, but the computation result is way off!
The single integral is computed quite quickly (e.g., if the double integrals are commented out, it takes only 0.2 seconds to compute the value of the function, which is tolerable).
Here is a heavily simplified version of the function for the MWE (just to illustrate the point of integration).
library(cubature)
library(pracma)
# Generate some artificial data to try this function on
set.seed(100)
n <- 200
r <- rnorm(n, 0.0004, 0.01)
# Log-likelihood function accepts 3 parameters:
# [1] shape of positive shocks, [2] shape of negative shocks, [3] DoF of Student's distribution for jumps
parm <- c(6, 7, 10)
LL <- function(parm, cub = "default") {
shapes <- parm[1:2]
studdof <- parm[3]
# For simplification, generate some dynamic series
set.seed(101)
sigmaeps <- rgamma(n, shape=shapes[1], rate=1000)
sigmaeta <- rgamma(n, shape=shapes[2], rate=1000)
lambdas <- rgamma(n, shape=10, rate=80)+1
probs <- sapply(lambdas, function(x) dpois(0:2, lambda=x))
probs <- sweep(probs, 2, colSums(probs), FUN="/") # Normalising the probabilities
# Reserving memory for 3 series of density
fw0 <- rep(NA, n)
fw1 <- rep(NA, n)
fw2 <- rep(NA, n)
for (t in 2:n) {
integ0 <- function(e) { # First integrand for 0 jumps
1/sigmaeta[t] * dgamma(-(r[t]-sigmaeps[t]*e)/sigmaeta[t], shape=shapes[2]) * # Density of negative shocks
dgamma(e, shape=shapes[1]) # Density of positive shocks
}
integ1 <- function(e, g) { # Double integrand for 1 jump
1/sigmaeta[t] * dgamma(-(r[t]-sigmaeps[t]*e-1*g)/sigmaeta[t], shape=shapes[2]) * # Density of negative shocks
dgamma(e, shape=shapes[1]) * # Density of positive shocks
dt(g, df = studdof)/1 # Density of jump intensity
}
integ2 <- function(e, g) { # Double integrand for 2 jumps
1/sigmaeta[t] * dgamma(-(r[t]-sigmaeps[t]*e-2*g)/sigmaeta[t], shape=shapes[2]) * # Density of negative shocks
dgamma(e, shape=shapes[1]) * # Density of positive shocks
dt(g, df = studdof)/2 # Density of jump intensity
}
# Wrappers for cubature because they need vector inputs
wrapper1 <- function(x) integ1(x[1], x[2])
wrapper2 <- function(x) integ2(x[1], x[2])
# Single integral that is not a problem
fw0[t] <- integrate(integ0, 0, Inf)$value
if (cub=="cubature") {
# 2D CUBATURE FROM cubature PACKAGE
fw1[t] <- hcubature(wrapper1, c(0, -20), c(20, 20))$integral
fw2[t] <- hcubature(wrapper2, c(0, -20), c(20, 20))$integral
} else if (cub=="prac2d") {
# 2D CUBATURE FROM pracma PACKAGE
fw1[t] <- quad2d(integ1, 0, 100, -100, 100)
fw2[t] <- quad2d(integ2, 0, 100, -100, 100)
} else if (cub=="default") {
# DOUBLE INTEGRALS FROM BUILT-IN INTEGRATE
fw1[t] <- integrate(function(g) { sapply(g, function(g) { integrate(function(e) integ1(e, g), 0, Inf)$value }) }, -Inf, Inf)$value
fw2[t] <- integrate(function(g) { sapply(g, function(g) { integrate(function(e) integ2(e, g), 0, Inf)$value }) }, -Inf, Inf)$value
}
if (!t%%10) print(t)
}
fw <- fw0*probs[1, ] + fw1*probs[2, ] + fw2*probs[3, ]
fw <- log(fw[2:n])
fw[is.nan(fw)] <- -Inf
slfw <- sum(fw)
print(paste0("Point: ", paste(formatC(parm, 4, format="e", digits=3), collapse=" "), ", LL: ", round(slfw, 2)))
return(slfw)
}
system.time(LL(parm, cub="default"))
# 13 seconds
# "Point: 6.000e+00 7.000e+00 1.000e+01, LL: 247.78"
system.time(LL(parm, cub="cubature"))
# 29 seconds, the result is slightly off
# "Point: 6.000e+00 7.000e+00 1.000e+01, LL: 241.7"
system.time(LL(parm, cub="prac2d"))
# 0.5 seconds, the result is way off
# "Point: 6.000e+00 7.000e+00 1.000e+01, LL: 223.25"
(Ideally, integ1(e, g) and integ2(e, g) should be integrated over [0, Inf) w.r.t. e and over (-Inf, Inf) w.r.t. g.)
Parallelisation is done at a higher level (i.e., the stochastic optimiser is computing the values of this likelihood function in parallel), so it is essential that this function run as quickly as possible on a single core.
Is there any way to speed up the computation of this double integral?
Here is a wrapper for hcubature which I use to allow infinite limits:
hcubature.inf <- function() {
cl <- match.call()
cl[[1L]] <- quote(cubature::hcubature)
if(all(is.finite(c(lowerLimit,upperLimit)))) return(eval.parent(cl))
# convert limits to new coordinates to incorporate infinities
cl[['upperLimit']] <- atan(upperLimit)
cl[['lowerLimit']] <- atan(lowerLimit)
# wrap the function with the coordinate transformation
# update argument to hcubature with our function
f <- match.fun(f)
cl[['f']] <- if(!vectorInterface)
function(x, ...) f(tan(x), ...) / prod(cos(x))^2
else
function(x, ...) f(tan(x), ...) / rep(apply(cos(x), 2, prod)^2, each=fDim)
eval.parent(cl)
}
formals(hcubature.inf) <- formals(cubature::hcubature)
Then you should vectorize the integrands:
vwrapper1 <- function(x) as.matrix(integ1(x[1,], x[2,]))
vwrapper2 <- function(x) as.matrix(integ2(x[1,], x[2,]))
And integrate:
if (cub=="cubature.inf") {
fw1[t] <- hcubature.inf(vwrapper1, c(0, -Inf), c(Inf, Inf), vectorInterface=TRUE)$integral
fw2[t] <- hcubature.inf(vwrapper2, c(0, -Inf), c(Inf, Inf), vectorInterface=TRUE)$integral
} else if (cub=="cubature") {
...
You get a value of 242.83 in about half the time of your default method.

Using optimize() to find the shortest interval that takes 95% area under a curve in R

Background:
I have a curve whose Y-values are produced by my small R function below (neatly annotated). If you run my entire R code, you see my curve (but remember, it's a function so if I changed the argument values, I could get a different curve):
Question:
Obviously, one can determine/assume many intervals that would cover/take 95% of the total area under this curve. But using, optimize(), how can I find the SHORTEST (in x-value units) of these many possible 95% intervals? What then would be the corresponding x-values for the the two ends of this shortest 95% interval?
Note: The idea of shortest interval for a uni-modal curve like mine makes sense. In reality, the shortest one would be the one that tends to be toward the middle where the height (y-value) is larger, so then x-value doesn't need to be so large for the intended interval to cover/take 95% of the total area under the curve.
Here is my R code (please run the entire code):
ppp <- function(f, N, df1, df2, petasq, alpha, beta) {
pp <- function(petasq) dbeta(petasq, alpha, beta)
ll <- function(petasq) df(f, df1, df2, (petasq * N) / (1 - petasq) )
marg <- integrate(function(x) pp(x)*ll(x), 0, 1)[[1]]
po <- function(x) pp(x)*ll(x) / marg
return(po(petasq) )
}
## ### END OF MY R FUNCTION.
# Now I use my function above to get the y-values for my plot:
petasq <- seq(0, 1, by = .0001) ## These are X-values for my plot
f <- 30 # a function needed argument
df1 <- 3 # a function needed argument
df2 <- 108 # a function needed argument
N <- 120 # a function needed argument
alpha = 5 # a function needed argument
beta = 4 # a function needed argument
## Now use the ppp() function to get the Y-values for the X-value range above:
y.values <- ppp(f, N, df1, df2, petasq, alpha, beta)
## Finally plot petasq (as X-values) against the Y.values:
plot(petasq, y.values, ty="l", lwd = 3 )
Based on your revised question, I found the optimization that minimizes the SHORTEST distance (in x-value units) between LEFT and RIGHT boundaries:
ppp <- function(petasq, f, N, df1, df2, alpha, beta) {
pp <- function(petasq) dbeta(petasq, alpha, beta)
ll <- function(petasq) df(f, df1, df2, (petasq * N) / (1 - petasq) )
marg <- integrate(function(x) pp(x)*ll(x), 0, 1)[[1]]
po <- function(x) pp(x)*ll(x) / marg
return(po(petasq) )
}
petasq <- seq(0, 1, by = .0001) ## These are X-values for my plot
f <- 30 # a function needed argument
df1 <- 3 # a function needed argument
df2 <- 108 # a function needed argument
N <- 120 # a function needed argument
alpha = 5 # a function needed argument
beta = 4 # a function needed argument
optim_func <- function(x_left) {
int_function <- function(petasq) {
ppp(petasq, f=f, N=N, df1=df1, df2=df2, alpha=alpha, beta=beta)
}
# For every LEFT value, find the corresponding RIGHT value that gives 95% area.
find_95_right <- function(x_right) {
(0.95 - integrate(int_function, lower=x_left, upper=x_right, subdivisions = 10000)$value)^2
}
x_right_obj <- optimize(f=find_95_right, interval=c(0.5,1))
if(x_right_obj$objective > .Machine$double.eps^0.25) return(100)
#Return the DISTANCE BETWEEN LEFT AND RIGHT
return(x_right_obj$minimum - x_left)
}
#MINIMIZE THE DISTANCE BETWEEN LEFT AND RIGHT
x_left <- optimize(f=optim_func, interval=c(0.30,0.40))$minimum
find_95_right <- function(x_right) {
(0.95 - integrate(int_function, lower=x_left, upper=x_right, subdivisions = 10000)$value)^2
}
int_function <- function(petasq) {
ppp(petasq, f=f, N=N, df1=df1, df2=df2, alpha=alpha, beta=beta)
}
x_right <- optimize(f=find_95_right, interval=c(0.5,1))$minimum
See the comments in the code. Hopefully this finally satisfies your question :) Results:
> x_right
[1] 0.5409488
> x_left
[1] 0.3201584
Also, you can plot the distance between LEFT and RIGHT as a function of the left boundary:
left_x_values <- seq(0.30, 0.335, 0.0001)
DISTANCE <- sapply(left_x_values, optim_func)
plot(left_x_values, DISTANCE, type="l")
If we think of this as trying to calculate the interval with the smallest area, we can start calculating the areas of each of the regions we are plotting. We can then find the largest area (which presumably will be near the center) and start walking out till we found the area we are looking for.
Since you've already calculate the x and y values for the plot, i'll reuse those to save some calculations. Here's an implementation of that algorithm
pseduoarea <- function(x, y, target=.95) {
dx <- diff(x)
areas <- dx * .5 * (head(y,-1) + tail(y, -1))
peak <- which.max(areas)
range <- c(peak, peak)
found <- areas[peak]
while(found < target) {
if(areas[range[1]-1] > areas[range[2]+1]) {
range[1] <- range[1]-1
found <- found + areas[range[1]-1]
} else {
range[2] <- range[2]+1
found <- found + areas[range[2]+1]
}
}
val<-x[range]
attr(val, "indexes")<-range
attr(val, "area")<-found
return(val)
}
And we call it with
pseduoarea(petasq, y.values)
# [1] 0.3194 0.5413
This does assume that all the values in petasq are equally spaced
I don't think you need to use optimize (unless this were part of an unadmitted homework assignment). Instead just normalize a cumulative sum and figure out at which points your criteria are met:
> which(cusm.y >= 0.025)[1]
[1] 3163
> which(cusm.y >= 0.975)[1]
[1] 5375
You can check that these are reasonable indices to use for the pulling values from the petasq vector with:
abline( v= c( petasq[ c( which(cusm.y >= 0.025)[1], which(cusm.y >= 0.975)[1])]),
col="red")
This is admittedly equivalent to constructing an integration function with a normalization constant across the domain of the "density" function. The fact that the intervals are all of equal dimension allows omitting the differencing of "x"-vector from the height times base calculation.
I suppose there is another interpretation possible. That would require that we discover how many values of an ascending-sorted version of petasq are needed to sum to 95% of the total sum. This gives a different strategy and the plot shows where a horizontal line would intersect the curve:
which( cumsum( sort( y.values, decreasing=TRUE) ) > 0.95* sum(y.values, na.rm=TRUE) )[1]
#[1] 2208
sort( y.values, decreasing=TRUE)[2208]
#[1] 1.059978
png()
plot(petasq, y.values, ty="l", lwd = 3 )
abline( h=sort( y.values, decreasing=TRUE)[2208], col="blue")
dev.off()
To get the petasq values you would need to determine the first y.values that exceeded that value and then the next y.values that dropped below that level. These can be obtained via:
order(y.values, decreasing=TRUE)[2208]
#[1] 3202
order(y.values, decreasing=TRUE)[2209]
#[1] 5410
And then the plot would look like:
png(); plot(petasq, y.values, ty="l", lwd = 3 )
abline( v= petasq[ c(3202, 5410)], col="blue", lty=3, lwd=2)
dev.off()
The area between the two dotted blue lines is 95% of the total area above the zero line:

Solving a double integral associated with Multivariate Normal desity

I am trying to solve a double integral associated with the Multivariate Normal density with known mean vector and covariance matrix:
library(cubature)
mu1 <- matrix(c(3,3), nrow=2)
sigma1 <- rbind(c(4,-1), c(-1,6))
quadratic <- function(a,b) {
X <- matrix(c(a,b),nrow=2)
Q <- (-1/2)*t(X-mu1)%*%solve(sigma1)%*%(X-mu1)
}
NormalPDF <- function(x1,x2) {
f <- (1/(2*pi))*(1/sqrt(det(sigma1)))*exp(quadratic(x1,x2))
}
# Solving for P(1 < X1 < 3, 1 < X2 < 3)
P <- adaptIntegrate(NormalPDF(x1,x2), c(1,3), c(1,3))
However, it keeps giving me the error:
Error in matrix(c(a, b), nrow = 2) : object 'x1' not found
Is there any obvious error with my code?
HubertL has pointed out that the first argument should be a function, not a function-call with arguments. It is assumed that the function will accept an "x" argument, a single length 2 vector, so the NormalPDF function needs to be modified in its arguments and in its call to the helper function. Another error was is how the limits are set up.
Consider this:
library(cubature)
mu1 <- matrix(c(3,3), nrow=2)
sigma1 <- rbind(c(4,-1), c(-1,6))
quadratic <- function(a,b) {
X <- matrix(c(a,b),nrow=2)
Q <- (-1/2)*t(X-mu1)%*%solve(sigma1)%*%(X-mu1)
}
NormalPDF <- function(x) {
f <- (1/(2*pi))*(1/sqrt(det(sigma1)))*exp(quadratic(x[1],x[2]))
}
# Solving for P(1 < X1 < 3, 1 < X2 < 3)
P <- adaptIntegrate( NormalPDF, lowerLimit= c(1,1), upperLimit=c(3,3))
P
#==============
$integral
[1] 0.09737084
$error
[1] 1.131395e-08
$functionEvaluations
[1] 17
$returnCode
[1] 0
This integrates that density over the square with "lower left hand corner" at (1,1) and "upper right corner" at (3,3). The invocation in the question would have always returned 0, since the domain was a single point. Would need to be extracted from the list with P$integral if you were going to do anything "numerical" with it. Seems reasonable that the result was less than 0.25 since we were only evaluating in the quarter-plane from the maximum at (3,3).

Normalization function in R

I have a matrix I want to transform, such that every feature in the transformed dataset has mean of 0 and variance of 1.
I have tried to use the following code:
scale <- function(train, test)
{
trainmean <- mean(train)
trainstd <- sd(train)
xout <- test
for (i in 1:length(train[1,])) {
xout[,i] = xout[,i] - trainmean(i)
}
for (i in 1:lenght(train[1,])) {
xout[,i] = xout[,i]/trainstd[i]
}
}
invisible(xout)
normalized <- scale(train, test)
This is, however, not working for me. Am I on the right track?
Edit: I am very new to the syntax!
You can use the built-in scale function for this.
Here's an example, where we fill a matrix with random uniform variates between 0 and 1 and centre and scale them to have 0 mean and unit standard deviation:
m <- matrix(runif(1000), ncol=4)
m_scl <- scale(m)
Confirm that the column means are 0 (within tolerance) and their standard deviations are 1:
colMeans(m_scl)
# [1] -1.549004e-16 -2.490889e-17 -6.369905e-18 -1.706621e-17
apply(m_scl, 2, sd)
# [1] 1 1 1 1
See ?scale for more details.
To write your own normalisation function, you could use:
my_scale <- function(x) {
apply(m, 2, function(x) {
(x - mean(x))/sd(x)
})
}
m_scl <- my_scale(m)
or the following, which is probably faster on larger matrices
my_scale <- function(x) sweep(sweep(x, 2, colMeans(x)), 2, apply(x, 2, sd), '/')
Just suggesting another own written normalizing function avoiding apply with is from my experience slower than matrix computation:
m = matrix(rnorm(5000, 2, 3), 50, 100)
m_centred = m - m%*%rep(1,dim(m)[2])%*%rep(1, dim(m)[2])/dim(m)[2]
m_norm = m_centred/sqrt(m_centred^2%*%rep(1,dim(m)[2])/(dim(m)[2]-1))%*%rep(1,dim(m)[2])
## Verirication
rowMeans(m_norm)
apply(m_norm, 1, sd)
(Note that here row vectors are considered)

setting upper and lower limits in rnorm

I am simulating data using rnorm, but I need to set an upper and lower limit, does anyone know how to do this?
code:
rnorm(n = 10, mean = 39.74, sd = 25.09)
Upper limit needs to be 340, and the lower limit 0
I am asking this question because I am rewriting an SAS-code into an R-code. I have never used SAS.
I am trying to rewrite the following piece of code:
sim_sample(simtot=100000,seed=10004,lbound=0,ubound=340,round_y=0.01,round_m=0.01,round_sd=0.01,n=15,m=39.74,sd=25.11,mk=4)
The rtruncnorm() function will return the results you need.
library(truncnorm)
rtruncnorm(n=10, a=0, b=340, mean=39.4, sd=25.09)
You can make your own truncated normal sampler that doesn't require you to throw out observations quite simply
rtnorm <- function(n, mean, sd, a = -Inf, b = Inf){
qnorm(runif(n, pnorm(a, mean, sd), pnorm(b, mean, sd)), mean, sd)
}
Like this?
mysamp <- function(n, m, s, lwr, upr, nnorm) {
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
stop(simpleError("Not enough values to sample from. Try increasing nnorm."))
}
set.seed(42)
mysamp(n=10, m=39.74, s=25.09, lwr=0, upr=340, nnorm=1000)
#[1] 58.90437 38.72318 19.64453 20.24153 39.41130 12.80199 59.88558 30.88578 19.66092 32.46025
However, the result is not normal distributed and usually won't have the mean and sd you've specified (in particular if the limits are not symmetric around the specified mean).
Edit:
According to your comment it seems you want to translate this SAS function. I am not an SAS user, but this should do more or less the same:
mysamp <- function(n, m, s, lwr, upr, rounding) {
samp <- round(rnorm(n, m, s), rounding)
samp[samp < lwr] <- lwr
samp[samp > upr] <- upr
samp
}
set.seed(8)
mysamp(n=10, m=39.74, s=25.09, lwr=0, upr=340, rounding=3)
#[1] 37.618 60.826 28.111 25.920 58.207 37.033 35.467 12.434 0.000 24.857
You may then want to use replicate to run the simulations. Or if you want faster code:
sim <- matrix(mysamp(n=10*10, m=39.74, s=25.09, lwr=0, upr=340, rounding=3), 10)
means <- colMeans(sim)
sds <- apply(sim, 2, sd)
Assuming you want exactly 10 numbers and not the subset of them that is >0, <340 (and night not be a normal distribution):
aa <- rnorm(n = 10, mean = 39.74, s = 25.09)
while(any(aa<0 | aa>340)) { aa <- rnorm(n = 10, mean = 39.74, s = 25.09) }
This is the function that I wrote to achieve the same purpose. It normalizes the result from the rnorm function and then adjusts it to fit the range.
NOTE: The standard deviation and mean (if specified) get altered during the normalization process.
#' Creates a random normal distribution within the specified bounds.
#'
#' WARNING: This function does not preserve the standard deviation or mean.
#' #param n The number of values to be generated
#' #param mean The mean of the distribution
#' #param sd The standard deviation of the distribution
#' #param lower The lower limit of the distribution
#' #param upper The upper limit of the distribution
rtnorm <- function(n, mean=NA, sd=1, lower=-1, upper=1){
mean = ifelse(is.na(mean)|| mean < lower || mean > upper,
mean(c(lower, upper)), mean)
data <- rnorm(n, mean=m, sd=sd) # data
if (!is.na(lower) && !is.na(upper)){ # adjust data to specified range
drange <- range(data) # data range
irange <- range(lower, upper) # input range
data <- (data - drange[1])/(drange[2] - drange[1]) # normalize data (make it 0 to 1)
data <- (data * (irange[2] - irange[1]))+irange[1] # adjust to specified range
}
return(data)
}
There are several ways to set upper and lower limits to a normal distribution, what will cause that the result is no longer normal distributed.
Assuming a mean=0, sd=1 producing N=1e5 values with a lower boundary of LO=-1 and an upper boundary of UP=2.
N <- 1e5L
LO <- -1
UP <- 2
Move outliers to border (#Roland)
set.seed(42)
x <- pmax(LO, pmin(UP, rnorm(N)))
mean(x)
#[1] 0.07238029
median(x)
#[1] -0.002066374
sd(x)
#[1] 0.8457605
hist(x, 30)
Cut outliers of (#Dason, #Roland, truncnorm::rtruncnorm, MCMCglmm::rtnorm)
set.seed(42)
x <- qnorm(runif(N, pnorm(LO), pnorm(UP)))
mean(x)
#[1] 0.2317875
median(x)
#[1] 0.173679
sd(x)
#[1] 0.7236536
Scale (#Alex Essilfie)
set.seed(42)
x <- rnorm(N)
x <- (x-min(x))/(max(x)-min(x))*(UP-LO)+LO
mean(x)
#[1] 0.4474876
median(x)
#[1] 0.4482257
sd(x)
#[1] 0.3595199
Combination of methods. E.g. Cut and scale:
set.seed(42)
x <- qnorm(runif(N, pnorm(-3), pnorm(3)))
x <- (x-min(x))/(max(x)-min(x))*(UP-LO)+LO
mean(x)
#[1] 0.5010759
median(x)
#[1] 0.5014713
sd(x)
#[1] 0.4957751
Asymmetric combination
set.seed(42)
n <- round(N*abs(LO)/diff(range(c(LO, UP))))
x <- c(qnorm(runif(n, pnorm(-3), 0.5)), qnorm(runif(N-n, 0.5, pnorm(3))))
x <- ifelse(x < 0, x/min(x)*LO, x/max(x)*UP)
mean(x)
#[1] 0.2651627
median(x)
#[1] 0.2127903
sd(x)
#[1] 0.5078264

Resources