Multiple random values between specific ranges in R? - r

I want to pick up 50 samples from (TRUNCATED) Normal Distribution (Gaussian) in a range 15-85 with mean=35, and sd=30. For reproducibility:
num = 50 # number of samples
rng = c(15, 85) # the range to pick the samples from
mu = 35 # mean
std = 30 # standard deviation
The following code gives 50 samples:
rnorm(n = num, mean = mu, sd = std)
However, I want these numbers to be strictly between the range 15-85. How can I achieve this?
UPDATE: Some people made great points in the comment section that this problem can not be solved as this will no longer be Gaussian Distribution. I added the word TRUNCATED to the original post so it makes more sense (Truncated Normal Distribution).

As Limey said in the comments, by imposing a bounded region the distribution is no longer normal. There are several ways to achieve this.
library("MCMCglmm")
rtnorm(n = 50, mean = mu, sd = std, lower = 15, upper = 85)
is one method. If you want a more manual approach you could simulate using uniform distribution within the range and apply the normal distribution function
bounds <- c(pnorm(15, mu, std), pnorm(50, mu, std))
samples <- qnorm(runif(50, bounds[1], bounds[2]), mu, std)
The idea is very basic: Simulate the quantiles of the outcome, and then estimate the value of the specific quantive given the distribution. The value of this approach rather than the approach linked by GKi is that it ensures a "normal-ish" distribution, where simulating and bounding the resulting vector will cause the bounds to have additional mass compared to the normal distribution.
Note the outcome is not normal, as it is bounded.

Related

Estimating PDF with monotonically declining density at tails

tldr: I am numerically estimating a PDF from simulated data and I need the density to monotonically decrease outside of the 'main' density region (as x-> infinity). What I have yields a close to zero density, but which does not monotonically decrease.
Detailed Problem
I am estimating a simulated maximum likelihood model, which requires me to numerically evaluate the probability distribution function of some random variable (the probability of which cannot be analytically derived) at some (observed) value x. The goal is to maximize the log-likelihood of these densities, which requires them to not have spurious local maxima.
Since I do not have an analytic likelihood function I numerically simulate the random variable by drawing the random component from some known distribution function, and apply some non-linear transformation to it. I save the results of this simulation in a dataset named simulated_stats.
I then use density() to approximate the PDF and approxfun() to evaluate the PDF at x:
#some example simulation
Simulated_stats_ <- runif(n=500, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
#approximation for x
approxfun(density(simulated_stats))(x)
This works well within the range of simulated simulated_stats, see image:
Example PDF. The problem is I need to be able to evaluate the PDF far from the range of simulated data.
So in the image above, I would need to evaluate the PDF at, say, x=50:
approxfun(density(simulated_stats))(50)
> [1] NA
So instead I use the from and to arguments in the density function, which correctly approximate near 0 tails, such
approxfun(
density(Simulated_stats, from = 0, to = max(Simulated_stats)*10)
)(50)
[1] 1.924343e-18
Which is great, under one condition - I need the density to go to zero the further out from the range x is. That is, if I evaluated at x=51 the result must be strictly smaller. (Otherwise, my estimator may find local maxima far from the 'true' region, since the likelihood function is not monotonic very far from the 'main' density mass, i.e. the extrapolated region).
To test this I evaluated the approximated PDF at fixed intervals, took logs, and plotted. The result is discouraging: far from the main density mass the probability 'jumps' up and down. Always very close to zero, but NOT monotonically decreasing.
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), FUN = function(x){approxfun(
density(Simulated_stats_,from = 0, to = max(Simulated_stats_)*10)
)(x)})
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))
Result:
Non-monotonic log density far from density mass
My question
Does this happen because of the kernel estimation in density() or is it inaccuracies in approxfun()? (or something else?)
What alternative methods can I use that will deliver a monotonically declining PDF far from the simulated density mass?
Or - how can I manually change the approximated PDF to monotonically decline the further I am from the density mass? I would happily stick some linear trend that goes to zero...
Thanks!
One possibility is to estimate the CDF using a beta regression model; numerical estimate of the derivative of this model could then be used to estimate the pdf at any point. Here's an example of what I was thinking. I'm not sure if it helps you at all.
Import libraries
library(mgcv)
library(data.table)
library(ggplot2)
Generate your data
set.seed(123)
Simulated_stats_ <- runif(n=5000, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
Function to estimate CDF using gam beta regression model
get_mod <- function(ss,p = seq(0.02, 0.98, 0.02)) {
qp = quantile(ss, probs=p)
betamod = mgcv::gam(p~s(qp, bs="cs"), family=mgcv::betar())
return(betamod)
}
betamod <- get_mod(Simulated_stats_)
Very basic estimate of PDF at val given model that estimates CDF
est_pdf <- function(val, betamod, tol=0.001) {
xvals = c(val,val+tol)
yvals = predict(betamod,newdata=data.frame(qp = xvals), type="response")
as.numeric((yvals[1] - yvals[2])/(xvals[1] - xvals[2]))
}
Lets check if monotonically increasing below min of Simulated_stats
test_x = seq(0,min(Simulated_stats_), length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummax(pdf))
[1] TRUE
Lets check if monotonically decreasing above max of Simulated_stats
test_x = seq(max(Simulated_stats_), 60, length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummin(pdf))
[1] TRUE
Additional thoughts 3/5/22
As discussed in comments, using the betamod to predict might slow down the estimator. While this could be resolved to a great extent by writing your own predict function directly, there is another possible shortcut.
Generate estimates from the betamod over the range of X, including the extremes
k <- sapply(seq(0,max(Simulated_stats_)*10, length.out=5000), est_pdf, betamod=betamod)
Use the approach above that you were initially using, i.e. a linear interpolation across the density, but rather than doing this over the density outcome, instead do over k (i.e. over the above estimates from the beta model)
lin_int = approxfun(x=seq(0,max(Simulated_stats_)*10, length.out=5000),y=k)
You can use the lin_int() function for prediction in the estimator, and it will be lighting fast. Note that it produces virtually the same value for a given x
c(est_pdf(38,betamod), lin_int(38))
[1] 0.001245894 0.001245968
and it is very fast
microbenchmark::microbenchmark(
list = alist("betamod" = est_pdf(38, betamod),"lin_int" = lint(38)),times=100
)
Unit: microseconds
expr min lq mean median uq max neval
betamod 1157.0 1170.20 1223.304 1188.25 1211.05 2799.8 100
lin_int 1.7 2.25 3.503 4.35 4.50 10.5 100
Finally, lets check the same plot you did before, but using lin_int() instead of approxfun(density(....))
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), lin_int)
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))

How do simulate a class of 100 students?

An exam with 20 multiple choice question with P=0.25 how do I simulate a class of 100 students and what is the average of the class of students. If the class is increased to 1000 what happens to the average?
I'm not sure where to begin. Other than just try to solving this manually.
n_experiments<-100
n_samples<-c(1:20)
means_of_sample_n<-c()
hist(rbinom( n = 100, size = 20, prob = 0.25 ))
I'm not sure what to do after this?
Well you just have to find a way to set the probability of answering correctly to 0.25, you can do that easily with the generation of a uniform distribution
n_experiments<-100
n_samples<-c(1:20)
means_of_sample_n<-c()
hist(rbinom( n = 100000, size = 20, prob = 0.25 ))
Nstu=100000
Nquest=20
Results=matrix(as.numeric(runif(100000*20)<0.25),ncol=20)
hist(apply(Results,1,sum))
mean(apply(Results,1,sum))
from the definition of the binomial distribution:
the mean is defined to be n*p, so mu = 20*0.25, giving a mean of 5. this is independent of the class size
the variance is defined to be n*p*(1-p), and the standard deviation is the usual sqrt of this, so sigma = sqrt(20*0.25*0.75), i.e. ~1.94.
the standard error of the mean is sigma / sqrt(k), where k would be your class size. so we get SEMs of 0.19 and 0.061 for class sizes of 100 and 1000 respectively
it's often useful to check things via simulation, and we can simulate a single class as you were doing.
x <- rbinom(100, 20, 0.25)
plot(table(x))
I'm using plot(table(x)) above instead of using hist, because this is a discrete distribution. hist is more suited to continuous distributions, while table is better for discrete distributions with a small number of distinct values.
next, we can simulate things many times using replicate. in this case you're after the mean of the binomial draw:
y <- replicate(1000, mean(rbinom(100, 20, 0.25)))
c(mu=mean(y), se=sd(y))
which happened to give me mu=5.002 and se=0.201, but will change every run. increasing the class size to 1000, I get mu=5.002 again, and se=0.060. because these are random samples from the distribution they are subject to "monte-carlo error" but given enough replicates they should approach the analytical answers above. that said, they're close enough to the analytical results to give me confidence I've not made any silly typos

Drawing from truncated normal distribution delivers wrong standard deviation in R

I draw random numbers from a truncated normal distribution. The truncated normal distribution is supposed to have mean 100 and standard deviation 60 after truncation at 0 from the left.
I computed an algorithm to compute the mean and sd of the normal distribution prior to the truncation (mean_old and sd_old).
The function vtruncnorm gives me the (wanted) variance of 60^2. However, when I draw random variables from the distribution, the standard deviation is around 96.
I don't understand why the sd of the random variables varies from the computation of 60.
I tried increasing the amount of draws - still results in sd around 96.
require(truncnorm)
mean_old = -5425.078
sd_old = 745.7254
val = rtruncnorm(10000, a=0, mean = mean_old, sd = sd_old)
sd(val)
sqrt(vtruncnorm( a=0, mean = mean_old, sd = sd_old))
Ok, I did quick test
require(truncnorm)
val = rtruncnorm(1000000, a=7.2, mean = 0.0, sd = 1.0)
sd(val)
sqrt(vtruncnorm( a=7.2, mean = 0.0, sd = 1.0))
Canonical truncated gaussian. At a=6 they are very close, 0.1554233 vs 0.1548865 f.e., depending on seed etc. At a = 7 they are systematically different, 0.1358143 vs 0.1428084 (sampled value is smaller that function call). I've checked with Python implementation
import numpy as np
from scipy.stats import truncnorm
a, b = 7.0, 100.0
mean, var, skew, kurt = truncnorm.stats(a, b, moments='mvsk')
print(np.sqrt(var))
r = truncnorm.rvs(a, b, size=100000)
print(np.sqrt(np.var(r)))
and got back 0.1428083662823426 which is consistent with R vtruncnorm result. At your a=7.2 or so results are even worse.
Moral of the story - at high a values sampling from rtruncnorm has a bug. Python has the same problem as well.

How to Standardize a Column of Data in R and Get Bell Curve Histogram to fins a percentage that falls within a ranges?

I have a data set and one of columns contains random numbers raging form 300 to 400. I'm trying to find what proportion of this column in between 320 and 350 using R. To my understanding, I need to standardize this data and creates a bell curve first. I have the mean and standard deviation but when I do (X - mean)/SD and get histogram from this column it's still not a bell curve.
This the code I tried.
myData$C1 <- (myData$C1 - C1_mean) / C1_SD
If you are simply counting the number of observations in that range, there's no need to do any standardization and you may directly use
mean(myData$C1 >= 320 & myData$C1 <= 350)
As for the standardization, it definitely doesn't create any "bell curves": it only shifts the distribution (centering) and rescales the data (dividing by the standard deviation). Other than that, the shape itself of the density function remains the same.
For instance,
x <- c(rnorm(100, mean = 300, sd = 20), rnorm(100, mean = 400, sd = 20))
mean(x >= 320 & x <= 350)
# [1] 0.065
hist(x)
hist((x - mean(x)) / sd(x))
I suspect that what you are looking for is an estimate of the true, unobserved proportion. The standardization procedure then would be applicable if you had to use tabulated values of the standard normal distribution function. However, in R we may do that without anything like that. In particular,
pnorm(350, mean = mean(x), sd = sd(x)) - pnorm(320, mean = mean(x), sd = sd(x))
# [1] 0.2091931
That's the probability P(320 <= X <= 350), where X is normally distributed with mean mean(x) and standard deviation sd(x). The figure is quite different from that above since we misspecified the underlying distribution by assuming it to be normal; it actually is a mixture of two normal distributions.

Convert uniform draws to normal distributions with known mean and std in R

I apply the sensitivity package in R. In particular, I want to use sobolroalhs as it uses a sampling procedure for inputs that allow for evaluations of models with a large number of parameters. The function samples uniformly [0,1] for all inputs. It is stated that desired distributions need to be obtained as follows
####################
# Test case: dealing with non-uniform distributions
x <- sobolroalhs(model = NULL, factors = 3, N = 1000, order =1, nboot=0)
# X1 follows a log-normal distribution:
x$X[,1] <- qlnorm(x$X[,1])
# X2 follows a standard normal distribution:
x$X[,2] <- qnorm(x$X[,2])
# X3 follows a gamma distribution:
x$X[,3] <- qgamma(x$X[,3],shape=0.5)
# toy example
toy <- function(x){rowSums(x)}
y <- toy(x$X)
tell(x, y)
print(x)
plot(x)
I have non-zero mean and standard deviations for some input parameter that I want to sample out of a normal distribution. For others, I want to uniformly sample between a defined range (e.g. [0.03,0.07] instead [0,1]). I tried using built in R functions such as
SA$X[,1] <- rnorm(1000, mean = 579, sd = 21)
but I am afraid this procedure messes up the sampling design of the package and resulted in odd results for the sensitivity indices. Hence, I think I need to adhere for the uniform draw of the sobolroalhs function in which and use the sampled value between [0, 1] when drawing out of the desired distribution (I think as density draw?). Does this make sense to anyone and/or does anyone know how I could sample out of the right distributions following the syntax from the package description?
You can specify mean and sd in qnorm. So modify lines like this:
x$X[,2] <- qnorm(x$X[,2])
to something like this:
x$X[,2] <- qnorm(x$X[,2], mean = 579, sd = 21)
Similarly, you could use the min and max parameters of qunif to get values in a given range.
Of course, it's also possible to transform standard normals or uniforms to the ones you want using things like X <- 579 + 21*Z or Y <- 0.03 + 0.04*U, where Z is a standard normal and U is standard uniform, but for some distributions those transformations aren't so simple and using the q* functions can be easier.

Resources